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step 0 cost: min 33.05 avg 47.01 max 53.67 

std dev - 2.65 avg_distance - 0.00 
cheat = 990.00 (min = 970.00, max = 1000.00) 
ranko = 38.05 yugo = 31.68 wire = 6.30 
step -- 0 cost 38.046944 
MMMMMMMMMCCCCCCR 

step 1 cost; min 33.59 avg 43.56 max 50.61 
std dev - 3.29 avg_distance - 98.98 
cheat = 970.00 (min = 950.00, max = 1000.00) 
ranko = 33.59 yugo = 31.34 wire = 5.75 
MMMMMMMMMCCCCCCR 

step 2 cost: min 29.10 avg 39.48 max 48.95 
std dev 4.15 avg_distance - 96.76 
cheat = 980.00 (min = 950.00, max = 1000.00) 
ranko = 29.10 yugp = 31.59 wire = 5.14 
MMMMMMMMMCCCCCCR 

step 3 cost: min 26.90 avg 36.16 max 48.16 
std dev - 4.67 avg_distance - 97.94 
cheat = 980.00 (min = 920.00, max = 1000.00) 
ranko = 26.90 yugo = 31.52 wire = 4.95 
MMMMMMMMMCCCCCCR 

step 4 cost: min 23.26 avg 32.59 max 46.99 
std dev -4.67 avg distance - 95.21 
cheat = 950.00 (min = 910.00, max = 1000.00) 
ranko = 23.26 yugp = 32.27 wire = 4.50 
MMMMMMMMMCCCCCCR 

step 5 cost: min 21.54 avg 29.38 max 43.33 
std dev -- 4.09 avg_distance - 90.03 
cheat = 980.00 (min = 910.00, max = 1000.00) 
ranko = 21.54 yugo = 31.76 wire = 4.26 
step- 5 cost -21.543444 
MMMMMMMMMCCCCCCR 

step 6 cost: min 19.19 avg 27.21 max 40.92 
std dev « 4.04 avg_distance - 82.35 
cheat = 980.00 (min = 920.00, max = 1000.00) 
ranko= 19.19 yugo = 31.75 wire = 4.01 
MMMMMMMMMCCCCCCR 
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step 7 cost: min 18.43 avg 25.20 max 40.36 
std dev - 3.83 avg_distance - 83.71 
cheat = 980.00 (min = 880.00, max = 1000.00) 
ranko = 18.43 yugo = 31.71 wire = 3.92 
MMMMMMMMMCCCCCCR 

step 8 cost: min 18.43 avg 23.02 max 35.82 
std dev - 3.18 avg_distance - 89.51 
cheat = 980.00 (min = 860.00, max = 1000.00) 
ranko = 18.43 yugo = 31.71 wire = 3.92 
MMMMMMMMMCCCCCCR 

step 9 cost: min 17.27 avg 22.08 max 38.14 
std dev - 3.36 avg_distance - 89.64 
cheat = 890.00 (min = 850.00, max = 990.00) 
ranko = 17.27 yugo = 31.01 wire = 3.63 
MMMMMMMMMCCCCCCR 

step 10cost: min 16.13 avg 20.99 max 34.48 
std dev » 3.09 avg_distance - 88.60 
cheat = 850.00 (min = 850.00, max = 990.00) 
ranko = 16.13 yugo = 31.01 wire = 3.50 
step-- 10 cost -16.125445 
MMMMMMMMMCCCCCCR 

step 11 cost: min 15.73 avg 19.84 max 34.48 
std dev - 3.02 avg_distance - 88.60 
cheat = 830.00 (min = 820.00, max = 990.00) 
ranko = 15.73 yugo = 31.20 wire = 3.34 
MMMMMMMMMCCCCCCR 

step 1 2 cost: min 1 5.03 avg 1 8.81 max 29.43 
std dev 2.42 avg_distance - 88.25. 
cheat = 830.00 (min = 780.00, max = 910.00) 
ranko = 15.03 yugo = 31.05 wire = 3.34 
MMMMMMMMMCCCCCCR 

step 13cost: min 14.61 avg 17.77 max 35.13 
std dev - 2.61 avg_distance - 87.00 
cheat = 830.00 (min = 770.00, max = 910.00) 
ranko = 14.61 yugo = 31.27 wire = 3.20 
MMMMMMMMMCCCCCCR 

step 14cost: min 13.22 avg 15,77 max 24.76 
std dev » 2.28 avg_distance - 85.22 
cheat = 820.00 (min = 790.00, max = 970.00) 
ranko = 13.22 yugp = 30.91 wire = 3.02 
MMMMMMMMMCCCCCCR 

step 15 cost: min 12.22avg 15.91 max 22.79 
std dev - 2.12 avg_distance - 83.45 
cheat = 800.00 (min = 780.00, max = 960.00) 
ranko= 12.22 yugo = 29.93 wire = 32.51 
step- 15 cost - 12.219247 
MMMMMMMMMCCCCCCR 

step 16cost: min 11.51 avg 15.48 max 26.64 
std dev - 2.57 avg_distance - 85.07 
cheat = 840.00 (min = 720.00, max = 930.00) 
ranko= 11.51 yugo = 30.05 wire = 2.62 
MMMMMMMMMCCCCCCR 
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step 17cost:min 11.10 avg 14.34 max 22.19 
std dev - 2.00 avgjjistance - 80. 1 0 
cheat = 680.00 (min = 680.00, max = 920.00) 
ranko = 11.10 yugo = 29.55 wire = 2.58 
MMMMMMMMMCCCCCCR 

step 1 8 cost: min 1 0.39 avg 1 4.32 max 23.80 
std dev »2.39 avg distance - 87.77 
cheat = 690.00 (min = 670.00, max = 930.00) 
ranko = 10.39 yugo = 28.34 wire = 2.38 
MMMMMMMMMCCCCCCR 
Switching Cost Function to WIRE 

step 19 cost: min 2.33 avg 2.94 max 4.04 

std dev - 0.39 avg_distance - 87.00 
cheat = 730.00 (min = 660.00, max = 910.00) 
ranko = 9.94 yugo = 28.77 wire = 2.33 
MMMMMMMMMCCCCCCR 

step 20 cost: min 2.24 avg 2.82 max 4.86 

std dev 0.45 avg_distance - 83.97 
cheat = 700.00 (min = 640.00, max = 920.00) 
ranko = 9.64 yugo = 28,67 wire = 2.24 
step- 20 cost -2.236364 
MMMMMMMMMCCCCCCR 

step 21 cost: min 2.21 avg 2.68 max 4.21 

std dev - 0.30 avg_distance - 81.75 
cheat = 670.00 (min = 630.00, max = 870.00) 
ranko = 9.47 yugo = 28.34 wire = 2.21 
MMMMMMMMMCCCCCCR 

step 22 cost: min 2.15 avg 2.57 max 3.58 

std dev - 0.23 avg_distance ~ 80.25 
cheat = 670.00 (min = 610.00, max = 860.00) 
ranko = 8.93 yugo = 28.25 wire = 2.15 
MMMMMMMMMCCCCCCR 

step 23 cost: min 2.05 avg 2.51 max 3.98 

std dev - 0.34 avg_distance - 79.18 
cheat = 630.00 (min = 480.00, max = 860.00) 
ranko = 8.66 yugo = 27.16 wire = 2.05 
MMMMMMMMMCCCCCCR 

step 24 cost: min 1.92 avg 2.38 max 3.25 

std dev - 0.22 avg ^distance - 73.32 
cheat = 600.00 (min = 460.00, max = 820.00) 
ranko = 8.02 yugo = 27.70 wire = 1.92 
MMMMMMMMMCCCCCCR 

step 25 cost: min 1.92 avg 2.29 max 3.00 

std dev -0.19 avg_distance - 73.17 
cheat = 600.00 (min = 460.00, max = 890.00) 
ranko = 8.02 yugo = 27.70 wire = 1.92 
step- 25 cost -1.918182 
MMMMMMMMMCCCCCCR 

step 26 cost: min 1.82 avg 2.22 max 3.22 

std dev - 0.23 avg_distance - 69.91 
cheat = 500.00 (min = 460.00, max = 750.00) 
ranko = 7.71 yugo = 27.10 wire = 1.82 
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step 27 cost: min 1.79 avg 2.22 max 2.68 

std dev -0.18 avg_distance - 66.09 
cheat = 400.00 (min = 400.00, max = 800.00) 
ranko = 8.05 yugo = 36.11 wire =1.79 
MMMMMMMMMCCCCCCR 

step 28 cost: min 1.64 avg 2.08 max 3.01 

std dev - 0.25 avg_distance - 64.42 
cheat = 380.00 (min = 380.00, max = 750.00) 
ranko = 7.11 yugo = 25.69 wire =1.64 
MMMMMMMMMCCCCCCR 

step 29 cost: min 1.64 avg 2.02 max 2.88 

std dev - 0.23 avg_distance -- 64.29 
cheat = 380.00 (min = 380.00, max = 700.00) 
ranko = 7.11 yugo = 25.69 wire = 1 .64 
MMMMMMMMMCCCCCCR 

step 30 cost: min 1.53 avg 1.95 max 2.71 

std dev - 0.21 avg_distance - 62.90 
cheat = 350.00 (min = 300.00, max = 770.00) 
ranko = 6.41 yugo = 25.22 wire = 1.52 
step- 30 cost -1.518182 
MMMMMMMMMCCCCCCR 

step 31 cost: min 1.50 avg 1.89 max 2.98 

std dev - 0.25 avg_distance - 58.00 
cheat = 350.00 (min = 300.00, max = 890.00) 
ranko = 6.52 yugo = 24.83 wire = 1.50 
MMMMMMMMMCCCCCCR 

step 32 cost: min 1.48 avg 1.65 max 2.66 

std dev - 0.24 avg_distance - 58.53 
cheat = 270.00 (min = 270.00, max = 590.00) 
ranko = 5.41 yugo = 24.71 wire = 1.48 
MMMMMMMMMCCCCCCR 

step 33 cost: min 1.42 avg 1.78 max 3.70 

std dev - 0.30 avg_distance - 56.60 
cheat = 260.00 (min = 220.00, max = 820.00) 
ranko = 6.10 yugo = 24.62 wire = 1.42 
MMMMMMMMMCCCCCCR 

step 34 cost: min 1.35 avg 1.70 max 2.35 

std dev -0.20 avg_distance - 53.01 
cheat = 210.00 (min = 170.00, max = 680.00) 
ranko = 5.63 yugo = 24.47 wire = 1.35 
MMMMMMMMMCCCCCCR 

step 35 cost: min 1 .34 avg 1.65 max 2.64 

std dev - 0.22 avg_distance - 50.26 
cheat = 210.00 (min = 190.00, max = 560.00) 
ranko = 5.57 yugo = 24.23 wire = 1.34 
step- 35 cost -1.336364 
MMMMMMMMMCCCCCCR 

step 36 cost: min 1.28 avg 1.57 max 3.39 

std dev -0.22 avg distance - 45.58 
cheat = 160.00 (min = 160.00, max = 840.00) 
ranko = 5.48 yugo = 23.79 wire = 1.38 
MMMMMMMMMCCCCCCR 
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step 37 cost: min 1.22 avg1.52 max 2.05 

std dev- 0.14 avg_distance - 41.10 
cheat = 110.00 (min = 110.00, max = 550.00) 
ranko = 5.37 yugo = 23.22 wire = 1.22 
MMMMMMMMMCCCCCCR 

step 38 cost: min 1.20 avg1.46 max 1.88 

stddev--0.15 avg_distance - 39.63 
cheat = 110.00 (min = 110.00, max = 440.00) 
ranko = 4.99 yugo = 23.21 wire = 1.20 
MMMMMMMMMCCCCCCR 

step 39 cost: min 1.10 avg 1.42 max 2.53 

std dev - 0.23 avg_distance - 38.48 
cheat = 70.00 (min = 70.00, max = 700.00) 
ranko = 4.49 yugo = 22.55 wire = 1.10 
MMMMMMMMMCCCCCCR 

step 40 cost: min 1.03 avg 1.31 max 1.76 

std dev -- 0. 1 6 avg_distance -- 29.52 
cheat = 20.00 (min = 20.00, max = 520.00) 
ranko = 4.13 yugo = 21.86 wire = 1.03 
step-- 40 cost -1.027273 
MMMMMMMMMCCCCCCR 

step 41 cost: min 1.00 avg 1.23 max 1.90 

std dev -- 0.1 9 avg_distance - 23.27 
cheat = 0.00 (min = 0.00, max = 530.00) 
ranko = 4.00 yugo = 21.51 wire = 1.00 
MMMMMMMMMCCCCCCR 

step 42 cost: min 1.00 avg 1.15 max 1.98 

std dev-- 0.17 avg_distance -- 16.79 
cheat = 0.00 (min = 0.00, max = 690.00) 
ranko = 4.00 yugo = 21.61 wire = 1.00 
MMMMMMMMMCCCCCCR 

step 43 cost: min 1.00 avg 1.11 max 1.75 
std dev — 0.15 avg_distance -- 1 1 .71 
cheat = 0.00 (min = 0.00, max = 390.00) 
ranko = 4.00 yugo = 21 .51 wire = 1 .00 
MMMMMMMMMCCCCCCR 
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OPTIMIZATION PROCESSING FOR 
INTEGRATED CIRCUIT PHYSICAL DESIGN 
AUTOMATION SYSTEM USING OPTIMALLY 
SWITCHED FITNESS IMPROVEMENT 

ALGORITHMS 5 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

10 

The present invention generally relates to the art of 
microelectronic circuit fabrication, and more specifically to 
a optimization processing for integrated circuit physical 
design automation system using optimally switched fitness 
improvement algorithms. 15 

2. Description of the Related Art 

CONTENTS 

1. Integrated Circuit (IQ Physical Design 

20 

2. Physical Design Algorithms 

a. Overview 

b. Simulated Annealing 

c. Simulated Evolution 

d. Force Directed Placement 

25 

3. Integrated Circuit Cell Placement Representation 

4. Cost Function Computation for IC Physical Design 

5. Parallel Processing Applied to IC Physical Design 

6. Distributed Shared Memory (DSM) Parallel Processing 
Architectures 30 

a. Overview 

b. Limitations of Basic DSM Architecture 

c. Telecommunications Network Applications 

1. Integrated Circuit (IC) Physical Design 35 

The automated physical design of a microelectronic inte- 
grated circuit is a specific, preferred example of simulta- 
neous optimization processing using a parallel processing 
architecture to which the present invention is directed. 

Microelectronic integrated circuits consist of a large num- 
ber of electronic components that are fabricated by layering 
several different materials on a silicon base or wafer. The 
design of an integrated circuit transforms a circuit descrip- 
tion into a geometric description which is known as a layout. 45 
A layout consists of a set of planar geometric shapes in 
several layers. 

The layout is then checked to ensure that it meets all of the 
design requirements. The result is a set of design files in a 
particular unambiguous representation known as an inter- 50 
mediate form that describes the layout. The design files are 
then converted into pattern generator files that are used to 
produce patterns called masks by an optical or electron beam 
pattern generator. 

During fabrication, these masks are used to pattern a 55 
silicon wafer using a sequence of photolithographic steps. 
The component formation requires very exacting details 
about geometric patterns and separation between them. The 
process of converting the specifications of an electrical 
circuit into a layout is called the physical design. It is an 60 
extremely tedious and an error-prone process because of the 
tight tolerance requirements and the minuteness of the 
individual components. 

Currently, the minimum geometric feature size of a com- 
ponent is on the order of 0.5 microns. However, it is 65 
expected that the feature size can be reduced to 0.1 micron 
within several years. This small feature size allows fabrica- 
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tion of as many as 4.5 million transistors or 1 million gates 
of logic on a 25 millimeter by 25 millimeter chip. This trend 
is expected to continue, with even smaller feature geom- 
etries and more circuit elements on an integrated circuit, and 
of course, larger die (or chip) sizes will allow far greater 
numbers of circuit elements. 

Due to the large number of components and the exacting 
details required by the fabrication process, physical design 
is not practical without the aid of computers. As a result, 
most phases of physical design extensively use Computer 
Aided Design (CAD) tools, and many phases have already 
been partially or fully automated. Automation of the physi- 
cal design process has increased the level of integration, 
reduced turn around time and enhanced chip performance. 

The objective of physical design is to determine an 
optimal arrangement of devices in a plane or in a three 
dimensional space, and an efficient interconnection or rout- 
ing scheme between the devices to obtain the desired 
functionality. Since space on a wafer is very expensive real 
estate, algorithms must use the space very efficiently to 
lower costs and improve yield. 

Currently available physical design automation systems 
are limited in that they are only capable of placing and 
routing approximately 20,000 devices or cells. Placement of 
larger numbers of cells is accomplished by partitioning the 
cells into blocks of 20,000 or less, and then placing and 
routing the blocks. This expedient is not satisfactory since 
the resulting placement solution is far from optimal. 

An exemplary integrated circuit chip is illustrated in FIG. 
1 and generally designated by the reference numeral 10. The 
circuit 10 includes a semiconductor substrate 12 on which 
are formed a number of functional circuit blocks that can 
have different sizes and shapes. Some are relatively large, 
such as a central processing unit (CPU) 14, a read-only 
memory (ROM) 16, a clock/timing unit 18, one or more 
random access memories (RAM) 20 and an input/output 
(I/O) interface unit 22. These blocks can be considered as 
modules for use in various circuit designs, and are repre- 
sented as standard designs in circuit libraries. 

The integrated circuit 10 further comprises a large 
number, which can be tens of thousands, hundreds of 
thousands or even millions or more of small cells 24. Each 
cell 24 represents a single logic element, such as a gate, or 
several logic elements that are interconnected in a standard- 
ized manner to perform a specific function. Cells 24 that 
consist of two or more interconnected gates or logic ele- 
ments are also available as standard modules in circuit 
libraries. 

The cells 24 and the other elements of the circuit 10 
described above are interconnected or routed in accordance 
with the logical design of the circuit to provide the desired 
functionality. Although not visible in the drawing, the vari- 
ous elements of the circuit 10 are interconnected by elec- 
trically conductive lines or traces that are routed, for 
example, through vertical channels 26 and horizontal chan- 
nels 28 that run between the cells 24. 

The input to the physical design problem is a circuit 
diagram, and the output is the layout of the circuit. This is 
accomplished in several stages including partitioning, floor 
planning, placement, routing and compaction. 

Partitioning — A chip may contain several million transis- 
tors. Layout of the entire circuit cannot be handled due to the 
limitation of memory space as well as the computation 
power available. Therefore it is normally partitioned by 
grouping the components into blocks such as subcircuits and 
modules. The actual partitioning process considers many 
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factors such as the size of the blocks, number of blocks and the blocks. Loose routing is converted into exact routing by 

number of interconnections between the blocks. specifying the geometric information such as width of wires 

The output of partitioning is a set of blocks, along with the and their layer assignments. Detailed routing includes chan- 

interconnections required between blocks. The set of inter- nel routing and switch box routing, 

connections required is referred to as a netlist. In large 5 Due to the nature of the routing algorithms, complete 

circuits, the partitioning process is often hierarchical, routi of aU connections cann ot be guaranteed in many 

although non-hierarchical (e.g. flat) processes can be used cases M a result a technique called "rip up and re-route" is 

and at the topmost level a circuit can have between 5 to 25 uged ^ remQves troublesome connections and re-routes 

blocks. However, greater numbers of blocks are possible and ^ . order 

contemplated. Each block is then partitioned recursively into 10 

smaller blocks. Compaction — Compaction is the task of compressing the 

Floor planning and placement— This step is concerned la y° ul iD all directions such that the total area is reduced. By 

with selecting good layout alternatives for each block of the making the chips smaller, wire lengths are reduced which in 

entire chip, as well as between blocks and to the edges. Floor turn reduces the signal delay between components of the 

planning is a critical step as it sets up the ground work for circuit. At the same time a smaller area enables more chips 

a good layout. However it is computationally quite hard. 15 to be produced on a wafer which in turn reduces the cost of 

Very often the task of floor plan layout is done by a design manufacturing. Compaction must ensure that no rules 

engineer using a CAD tool. This is necessary as the major regarding the design and fabrication process are violated, 

components of an IC are often intended for specific locations vlsj physical design is iterative in nature and many steps 

on the chip. sucn ^ gi 0Da i routing and channel routing are repeated 

Only for simple layouts can the current layout tools 2 several times to obtain a better layout. In addition, the 

provide a solution without human-engineering direction and quality of results obtained in one stage depends on the 

intervention. One aspect of the present invention will permit quality of solution obtained in earlier stages as discussed 

complex problems, including flow plan layout, to be accom- above. For example, a poor quality placement cannot be 

plished without regular human intervention. ^ fully cured by high quality routing. As a result, earlier steps 

During placement, the blocks are exactly positioned on have extensive influence on the overall quality of the solu- 

the chip. The goal of placement is to find a minimum area tion. 

arrangement for the blocks that allows completion of inter- In this sense, partitioning, floor planning and placement 
connections between the blocks. Placement is typically done problems play a more important role in determining the area 
in two phases. In the first phase, an initial placement is 3Q and cnip performance in comparison to routing and corn- 
created. In the second phase, the initial placement is evalu- pac tion. Since placement may produce an unroutable layout, 
ated and iterative improvements are made until the layout t he chip might need to be replaced or re-partitioned before 
has minimum area and conforms to design specifications. another routing is attempted. The whole design cycle is 
The vertical and horizontal channels 26 and 28 are conventionally repeated several times to accomplish the 
generally provided between the blocks in order to allow for 35 design objectives. The complexity of each step varies 
electrical interconnections. The quality of the placement will depending on the design constraints as well as the design 
not be evident until the routing phase has been completed. style used. 

A particular placement may lead to an unroutable design. ^ afea of the h ical desi blem tQ which ^ 

For example, routing may not be possible in the space of tne m mvention relates ^ the placernent and 

provided. In that case another iteration of placement is ^ ^ting of the cel is 24 and other elements on the integrated 

necessary. Sometimes routing is implemented over the entire cifcuit 1Q illustrated ^ nG . L After the circuit partitioning 

area, and not just over the channels. phase> ^ area occupied by each b i ock including the ele- 

To limit the number of iterations of the placement m ents designated as 14 to 22 and the cells 24 can be 

algorithm, an estimate of the required routing space is used calculated, and the number of terminals required by each 

during the placement phase. A good routing and circuit 4S block ^ known. In addition, the netlists specifying the 

performance heavily depend on a good placement algorithm. connections between the blocks are also specified. 

This is due to the fact that once the position of each block . , t , * , .... 

is fixed, very little can be done to improve the routing and ,. In ° rf f r t0 "JPf 8 th * la y° ul ' ,l <° 

11 . r. r & the blocks on the layout surface and interconnect their 

overall circuit performance. , . . A *, . r. , , 

„ ^ „ , , terminals according to the netlist. The arrangement of blocks 

RouUng--nie objective of the routing phase is to com- 50 * done m the placement phase while interconnection is 

plete the mterconnectioas between blocks according to the co leted in the routing phase. In the placement phase, the 

specified netlist. First, the space not occupied by blocks, bbcks m ^ d a ific sh and afe it £ ned on a 

which is called the ^routing space, is partitioned into rectan- u surface in such fl fashkm lhat nQ lWQ b , ocks flre 

gular regions called channels and switch boxes. The goal of overl m and en ough space is left on the layout surface to 

a router is to complete all circuit connections using the 55 complete interconnections between the blocks. The blocks 

shortest possible wire length and using only the channel and afe positioned ^ as t0 minimize the lolal area of lhe layom 

switch boxes. In add ition, the locations of the terminals on each block are 

Routing is usually done in two phases referred to as the a j so determined, 
global routing and detailed routing phases. In global routing, 

connections are completed between the proper blocks of the 60 2. Physical Design Algorithms 

circuit disregarding the exact geometric details of each wire a. Overview 

and terminal. For each wire, a global router finds a list of Very Large Scale Integrated Circuit (VLSI) physical 

channels that are to be used as a passageway for that wire. design automation utilizes algorithms and data structures 

In other words, global routing specifies the loose route of a related to the physical design process. A general treatise on 

wire through different regions of the routing space. 6 5 this art is presented in a textbook entitled "Algorithms for 

Global routing is followed by detailed routing which VLSI Physical Design Automation" by Naveed Sherwani, 

completes point-to-point connections between terminals on Kluwer Academic Publishers 1993. 
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Depending on the input, placement algorithms can be a very high value of temperature that gradually decreases so 

classified into two major groups, constructive placement and that moves that increase cost have a progressively lower 

iterative improvement methods. The input to the construe- probability of being accepted. Finally, the temperature 

tive placement algorithms consists of a set of blocks along reduces to a very low value which requires that only moves 

with the netlist. The algorithm finds the locations of the 5 that reduce costs are to be accepted. In this way, the 

blocks. On the other hand, iterative improvement algorithms algorithm converges to an optimal or near optimal configu- 

start with an initial placement. These algorithms modify the ration 

initial placement in search of a better placement. The In each stage, the placement is shuffled randomly to get a 

algorithms are applied m a recursive or an iterative manner new placemcnt nis random shuffli could be achieved by 

until no farther improvement is possible, or the solution is m a ^ fandom £ transposition of 

considered to be satisfactory based on a predetermined . n lL , . . L • 1 ,i_ 

cr j ter j a ' r two cells, or any other move that can change the wire length 

Iterative algorithms can be divided into three general or ° ther c u riteria - ^ ter me SQuffle > ih f chan f ' m cost » 

classifications, simulated annealing, simulated evolution and evaluated. If there is a decrease in cost, the configuration is 

force directed placement. The simulated annealing algo- accepted. Otherwise, the new configuration is accepted with 

rithm simulates the annealing process that is used to temper 15 a probability that depends on the temperature, 

metals. Simulated evolution simulates the biological process temperature is then lowered using some function 

of evolution, while the force directed placement simulates a which, for example, could be exponential in nature. The 

system of bodies attached by springs. process is stopped when the temperature is dropped to a 

Assuming that a number N of cells are to be optimally certain level. A number of variations and improvements on 
arranged and routed on an integrated circuit chip, the num- 20 the basic simulated annealing algorithm have been devel- 
ber of different ways that the cells can be arranged on the oped. An example is described in an article entitled "Tim- 
chip, or the number of permutations, is equal to N! (N berwolf 3.2 A New Standard Cell Placement and Global 
factorial). In the following description, each arrangement of Routing Package" by Carl Sechen, et al., IEEE 23rd 
cells will be referred to as a placement. In a practical Designed Automation Conference paper 26.1, pages 432 to 
integrated circuit chip, the number of cells can be hundreds 25 439. 
of thousands or millions. Thus, the number of possible c. Simulated Evolution 

placements is extremely large. Simulated evolution, which is also known as the genetic 
Interactive algorithms function by generating large num- algorithm, is analogous to the natural process of mutation of 
bers of possible placements and comparing them in accor- species as they evolve to better adapt to their environment, 
dance with some criteria which is generally referred to as 30 The algorithm starts with an initial set of placement con- 
fitness. The fitness of a placement can be measured in a figurations which is called the population. The initial place - 
number of different ways, for example, overall chip size. A ment can be generated randomly. The individuals in the 
small size is associated with a high fitness and vice versa. population represent a feasible placement to the optimiza- 
Another measure of fitness is the total wire length of the tion problem and are actually represented by a string of 
integrated circuit. A high total wire length indicates low 35 symbols. 

fitness and vice versa. The symbols used in the solution string are called genes. 

The relative desirability of various placement configura- A solution string made up of genes is called a chromosome, 

tions can alternatively be expressed in terms of cost, which A schema is a set of genes that make up a partial solution, 

can be considered as the inverse of fitness, with high cost The simulated evolution or genetic algorithm is iterated, and 

corresponding to low fitness and vice versa. 40 each iteration is called a generation. During each iteration, 

b. Simulated Annealing the individual placements of the population are evaluated on 

Basic simulated annealing per se is well known in the art the basis of fitness or cost. Two individual placements 

and has been successfully used in many phases of VLSI among the population are selected as parents, with prob- 

physical design such as circuit partitioning. Simulated abilities based on their fitness. The better fitness a placement 

annealing is used in placement as an iterative improvement 45 has, the higher the probability that it will be chosen, 

algorithm. Given a placement configuration, a change to that The genetic operators called crossover, mutation and 

configuration is made by moving a component or inter- inversion, which are analogous to their counterparts in the 

changing locations of two components. Such interchange evolution process, are applied to the parents to combine 

can be alternatively expressed as transposition or swapping. genes from each parent to generate a new individual called 

In the case of a simple pairwise interchange algorithm, it 50 the offspring or child. The offspring are evaluated, and a new 

is possible that a configuration achieved has a cost higher generation is formed by including some of the parents and 

than that of the optimum, but no interchange can cause the offspring on the basis of their fitness in a manner such 

further cost reduction. In such a situation, the algorithm is that the size of the population remains the same. As the 

trapped at a local optimum and cannot proceed further. This tendency is to select high fitness individuals to generate 

happens quite often when the algorithm is used in practical 55 offspring, and the weak individuals are deleted, the next 

applications. Simulated annealing helps to avoid getting generation tends to have individuals that have good fitness, 

stuck at a local optima by occasionally accepting moves that The fitness of the entire population improves over the 

result in a cost increase. generations. That means that the overall placement quality 

In simulated annealing, all moves that result in a decrease improves over iterations. At the same time, some low fitness 

in cost are accepted. Moves that result in an increase in cost 60 individuals are reproduced from previous generations to 

are accepted with a probability that decreases over the maintain diversity even though the probability of doing so is 

iterations. The analogy to the actual annealing process is quite low. In this way, it is assured that the algorithm does 

heightened with the use of a parameter called temperature T. not get stuck at some local optimum. 

This parameter controls the probability of accepting moves The first main operator of the genetic algorithm is 

that result in increased cost. 65 crossover, which generates offspring by combining sche- 

More of such moves are accepted at higher values of mata of two individuals at a time. This can be achieved by 

temperature than at lower values. The algorithm starts with choosing a random cut point and generating the offspring by 
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combining the left segment of one parent with the right 
segment of the other. However, after doing so, some cells 
may be duplicated while other cells are deleted. This prob- 
lem will be described in detail below. 

The amount of crossover is controlled by the crossover 5 
rate, which is defined as the ratio of the number of offspring 
produced by crossing in each generation to the population 
size. Crossover attempts to create offspring with fitness 
higher than either parent by combining the best genes from 
each. 10 

Mutation creates incremental random changes. The most 
commonly used mutation is pairwise interchange or trans- 
position. This is the process by which new genes that did not 
exist in the original generation, or have been lost, can be 
generated. is 

The mutation rate is defined as the ratio of the number of 
offspring produced by mutation in each generation to the 
population size. It must be carefully chosen because while it 
can introduce more useful genes, most mutations are harm- 
ful and reduce fitness. The primary application of mutation 20 
is to pull the algorithm out of local optima. 

Inversion is an operator that changes the representation of 
a placement without actually changing the placement itself 
so that an offspring is more likely to inherit certain schema 
from one parent. 25 

After the offspring are generated, individual placements 
for the next generation are chosen based on some criteria. 
Numerous selection criteria are available, such as total chip 
size and wire length as described above. In competitive 
selection, all the parents and offspring compete with each 30 
other, and the fittest placements are selected so that the 
population remains constant. In random selection, the place- 
ments for the next generation are randomly selected so that 
the population remains constant. 

TTie latter criteria is often advantageous considering the 35 
fact that by selecting the fittest individuals, the population 
converges to individuals that share the same genes and the 
search may not converge to an optimum. However, if the 
individuals are chosen randomly there is no way to gain 
improvement from older generation to new generation. By 40 
combining both methods, stochastic selection makes selec- 
tions with probabilities based on the fitness of each indi- 
vidual. 

d. Force Directed Placement 

Force directed placement exploits the similarity between 45 
the placement problem and the classical mechanics problem 
of a system of bodies attached to springs. In this method, the 
blocks connected to each other by nets are supposed to exert 
attractive forces on each other. The magnitude of this force 
is directly proportional to the distance between the blocks. 50 
Additional proportionality is achieved by connecting more 
"springs" between blocks that "talk" to each other more 
(volume, frequency, etc.) and fewer "springs" where less 
extensive communication occurs between each block. 

According to Hooke's Law, the force exerted due to the 55 
stretching of the springs is proportional to the distance 
between the bodies connected to the spring. If the bodies are 
allowed to move freely, they would move in the direction of 
the force until the system achieved equilibrium. The same 
idea is used for placing the cells. The final configuration of 60 
the placement of cells is the one in which the system 
achieves a solution that is closest to or in actual equilibrium. 

3. Integrated Circuit Cell Placement Representation 

Using physical design algorithms as discussed above, 65 
each cell placement is conventionally represented in the 
form of a list or table including locations on the chip and 
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identifiers of the cells that are assigned to the respective 
locations. As indicated at 30 in FIG. 2, an exemplary and 
greatly simplified cell placement includes nine cell locations 
that are designated as (1) to (9), and cells that are indicated 
by identifiers 1 to 9. The locations are numbered in con- 
secutive order from left to right and top to bottom. 

The cell locations are designated by numbers in 
parenthesis, whereas the cell identifiers are designated only 
as numbers. Although only nine cell locations are illustrated 
as constituting the placement 30, it will be understood that 
an actual integrated circuit chip can include hundreds of 
thousands, millions or more of cell locations. 

The cells in the placement 30 can be represented by a 
table or list as indicated at 32. The list 32 is comparable to 
a chromosome in biological genetics, whereas each entry in 
the list 32 is analogous to a gene. In a more general sense, 
the entries in the list can be considered as abstract entities, 
whereas the list can be considered as a permutation of the 
entities. 

In genetic mutation, a new placement is produced from an 
initial placement by transposing individual cells. Genetic 
inversion involves reversing the order of a group of con- 
secutive cells. These operations can be performed using the 
conventional placement representation illustrated in FIG. 2 
without problems. However, attempting to perform genetic 
crossover using the conventional representation will result in 
duplication and/or omission of cells, and other illegal place- 
ments. 

The reason that the conventional placement representa- 
tion is not applicable to straight genetic crossover is illus- 
trated in FIG. 2. In the illustrated example, a second place- 
ment 34 is provided as represented by a list 36. The 
placements 30 and 34, which are referred to as "parents", are 
genetically crossed with each other to produce two new 
placements 38 and 40 that are represented by lists 42 and 44 
respectively. The new placements 38 and 40 are referred to 
as "offspring" or "children". 

The placement 30 consists of cells 1 to 9 in locations (1) 
to (9) respectively. The placement 34 consists of cells 4 to 
9 and 1 to 3 in locations (1) to (9) respectively. It will be 
understood that the particular numerical arrangement of 
cells in the placements 30 and 34 is arbitrary, and that the 
principles involved could be alternatively illustrated and 
described using any numerical arrangement. 

In FIG. 2, genetic crossover is performed by transposing 
or "swapping" the last four elements in the lists 32 and 36. 
This produces the placement 38 as represented by the list 42 
which includes the first five elements in the list 32 and the 
last four elements in the list 36. The crossover further 
produces the placement 40 as represented by the list 44 
which includes the first five elements in the list 36 and the 
last four elements in the list 32. 

Both of the exemplary placements are illegal, in that they 
include duplications and omissions of cells. In the placement 
38, the cells 1, 2 and 3 are each represented twice, whereas 
the cells 6, 7 and 8 are omitted. In the placement 40, the cells 
6, 7 and 8 are each represented twice, whereas the cells 1, 
2 and 3 are omitted. It is clear that this method is inappli- 
cable to the physical design of integrated circuit chips 
because the circuits would be inoperative if cells were 
duplicated and/or omitted. 

An expedient for bypassing this problem is described in 
an article entitled "A GENETIC APPROACH TO STAN- 
DARD CELL PLACEMENT USING META-GENETIC 
PARAMETER OPTIMIZATION", by Khushro Shahookar 
et al, in IEEE Transactions on Computer- Aided Design, Vol. 
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9, No. 5, May 1990, pp. 500-511. Shahookar accomplishes 
bis goal by utilizing a complicated modification of genetic 
crossover referred to in the article as "cycle crossover". 
Other modified crossover operations which are discussed by 
Shahookar are referred to as "order crossover" and "partially 5 
mapped crossover" (PMX). 

The design of an integrated circuit chip requires the 
placement and routing of at least thousands of cells. The 
additional computing time required for the implementation 
of Shahookar's methods increases the total computer time 1Q 
for a typical integrated circuit design to such an inordinate 
value that it would be impractical to implement in a com- 
mercial production environment. 

4. Cost Function Computation for IC Physical 

Design 15 

FIGS. 3 and 4 illustrate a "half -perimeter" wire length 
computation method which is known in a basic form in the 
art per se. This method is described in the above referenced 
article to Sechen, and is advantageous in that it can be 
performed quickly in a non-computationally intensive man- 20 
ner. 

In FIG. 3, a cell placement 46 includes a plurality of cells 
48 that are allocated to respective locations on a surface 50 
representing an integrated circuit chip. A netlist for the 
placement includes a list of nets, each of which intercon- 2 s 
nects terminals on cells that are to be electrically equivalent. 
An exemplary net 52 is illustrated in the drawing as inter- 
connecting terminals 54, 56 and 58 of cells 48a, 48b and 48c 
respectively. 

The wirelength of the net 52 is estimated by defining or 3Q 
constructing a rectangular "bounding box" 60 that surrounds 
the outermost terminals of the net 52 and is spaced out- 
wardly therefrom in the horizontal and vertical directions by 
a "detour factor" 5 that allows for variations in the actual 
interconnect routing. The wirelength of the net 52 is com- 
puted or approximated as the half-perimeter, or the sum of 35 
the width and height of the bounding box 60. 

In the example of FIG. 3, the net 52 includes a horizontal 
leg between the terminals 54 and 56 that is approximately 
equal to the width of the bounding box 60, and a vertical leg 
between the terminals 56 and 58 that is approximately equal 40 
to the height of the bounding box 60. Thus, the half- 
perimeter method provides a good approximation of the 
wirelength of the net 52. 

However, this is not always the case. For example, as 
illustrated in FIG. 4, a placement 64 includes a plurality of 45 
cells 66 on a surface 68. A net 70 interconnects terminals 72, 
74, 76, 78, 80, 82, and 84 of cells 66a, 66fr, 66c, 66rf, 66e 
and 66/. The net 70 is enclosed by a bounding box 86. 

The net 70 includes a lower horizontal leg and a vertical 
leg that extends between the terminals 72 and 84. The 50 
lengths of these legs in combination is approximately equal 
to the half -perimeter of the bounding box 86. However, the 
net 70 further includes a plurality of vertical legs extending 
from the lower horizontal leg to the terminals 74, 76, 78, 80, 
82 and 84. 55 

The lengths of these vertical legs, in combination with the 
lengths of the legs extending between the terminals 72 and 
84, substantially exceed the half -perimeter of the bounding 
box 86. In this case, the half-perimeter estimation would 
produce a computed value of wirelength for the net 70 that 60 
is unrealistically low, and indicates a lower value of con- 
gestion than would actually be present. 

5. Parallel Processing Applied to IC Physical 

Design 65 

A major factor that prevents conventional algorithms from 
being utilized for the placement and routing of larger num- 
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ber of cells is that the physical design problem is executed 
using serial or uniprocessor computers. Numerous iterations 
of the placement and general and detailed routing algorithms 
are necessary before the solution converges to an optimal 
design. Execution of these iterations is extremely time 
consuming, requiring days, or even weeks or months to 
produce a design for a large integrated circuit. 

In addition, human intervention is required for all but the 
simplest designs. Since each stage of iteration inherits the 
results, but not the details, of the previous operational stage, 
no sharing of information between stages, such as placement 
and global routing, that could result in faster convergence, is 
inherent in the process. Feedback of routing information, for 
example, could speed up convergence of the placement 
operation. Since this does not occur, a large number of 
non-optimal solutions are generated, and a human technician 
is required to obtain an overview of the process and divert 
it away from false and/or inefficient solutions. 

An implementation in which the genetic algorithm is 
executed in parallel on separate computers is described in an 
article entitled "WOLVERINES: STANDARD CELL 
PLACEMENT ON A NETWORK OF WORKSTATIONS", 
by S. Mohan et al, IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Systems, Vol. 12, No. 9, 
September 1993, pp. 1312-1326. The procedure runs a basic 
genetic algorithm on each of a plurality of computer-aided- 
design (CAD) workstation in the network and utilizes an 
additional genetic operator, migration, which transfers 
placement information from one workstation to another 
across the network. Migration transfers genetic material 
from one environment to another, thereby introducing new 
genetic information and modifying the new environment. 

If the migrants are fitter than the existing individuals in 
the new environment, they get a high probability of repro- 
duction and their genetic material is incorporated into the 
local population. When the population is very small it tends 
to converge after a few generations, in the sense that all the 
individuals come to resemble one another. Migration pre- 
vents this premature convergence of inbreeding by intro- 
ducing new genetic material. In this manner, the genetic 
algorithm is modified by splitting the large population over 
different workstations and using the migration mechanism to 
prevent premature convergence. 

Although Mohan discloses the general concept of parallel 
processing of genetic algorithms, he teaches a procedure in 
which the various stages of integrated circuit chip design are 
performed in series, with no feedback or sharing of infor- 
mation between stages until an entire design is completed or 
at least the global routing stage is completed. 

6. Distributed Shared Memory (DSM) Parallel 
Processing Architectures 

a. Overview 

An architecture including a plurality, preferably many 
parallel processors that is especially suited for application to 
physical design automation of integrated circuits is known 
as cache coherent Distributed Shared Memory (DSM). Two 
examples of this architecture are presented in an article 
entitled "The Stanford Dash Multiprocessor", by Daniel 
Lenoski et al, et al, in Computer Magazine, March 1992, pp. 
63-79, and in a technical summary of the KSR1 System 
prepared by Kendall Square Research, of Walt ham, Mass., 
1992. 

A basic DSM architecture of the type described in the 
article to Lenoski (the DASH system) is illustrated in FIG. 
5. A DSM system 90 includes processors 92, 94, and 96, 98 
that are arranged in two clusters 100 and 102 respectively. 
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Cache memories 104, 106, 108 and 110 are connected to the In a DSM system such as described above with reference 

processors 92, 94, 96 and 98 respectively. The cluster 100 to FIG. 5, a shared or main memory is provided for data that 

further includes a shared memory 112 and a directory 114, is more global, operated on by more than one processor, or 

whereas the cluster 102 further includes a shared memory is too large to be stored in a local cache memory. A scalable 

116 and a directory 118. The clusters 100 and 102 commu- 5 mechanism, typically a directory structure, is provided to 

nicate with each other via an interconnection network 120. maintain the main memory and all of the cache memories 

Although only four processors are illustrated in FIG. 38, coherent with each other, 

in a practical application the number of processors will ^ ^ KCiory logic enables any processor to access data 

preferably be tens, hundreds or even thousands. The ^ caches m me main me 0f ^ cache me and mvalidates 

104 and 106 shared memory 112 and directory 114 are w or updates any ol ^ lete of data . ^ direct0 ^ based 

interconnected by a snooping bus 122, whereas the caches ' ^u^L™ is esneciallv advantaeeous in that the 

108 and 110, shared memory 116 and directory 118 are U5>M ^ T^. 18 ^P ecia ^ advanla S eou ? m mat tDe 

interconnected by a snooping bus 124. me T mo 7 b ^ width * lth the aumbe f f A P ro ' «* 

Hie arrangement of FIG. 5 is advantageous in that all of In view of ! he nu ™ adva " la S es P rovided b £ i h * DSI ^ 

the memory in the system, consisting of the caches 104, 106, architecture, it would be desirable to integrate a DSM node 

108 and 110 and shared memory 112 and 116, is available for 15 on a angje integrated circuit chip. However, the inherent 

use by all of the processors 92, 94, 96 and 98, and the characteristics of the conventional DSM design frustrate the 

memory is scalable. The memory used by each processor accomplishment of this goal using presently available 

can be dynamically allocated depending on the requirements microelectronic circuit fabrication technology, 

of a particular task. More specifically, it is highly preferable to store data in a 

However, the memory access times are different depend- 20 local cache memory, which is generaUy implemented as 

ing on the type of access. The processors can access the Static Random Access Memory (SRAM) rather than in a 

caches that are directly connected thereto at a highest speed, main memory, which is generally implemented as Dynamic 

and access the shared memory in their respective cluster at Random Access Memory (DRAM) due to the much lower 

a lower speed. A processor in one cluster can access a cache latency and access time. However, if a cache memory is not 

or shared memory in another cluster via the interconnection 25 large enough, some of the data that is required to be stored 

network 120, but at a yet lower speed. must be directed to the main memory. This data is said to 

The snooping buses 122 and 124 provide cache coherence "miss" the cache memory, and the number of memory access 

within the clusters 100 and 102 respectively, whereas the operations that must be performed using the main memory 

directories 114 and 118 provide cache coherence for the is referred to as the "cache miss rate", 

entire system 90. In the cache coherence scheme, multiple 30 Since the latency of the main memory is much higher than 

copies of a particular data block can exist in the different that of the cache memory, a large cache memory is required 

memories of the system. The directories 114 and 118 keep to provide an acceptably low cache miss rate. The time 

track of which data blocks are stored in which memories. required to process a cache miss, which is referred to as the 

If a data block is altered by any of the processors, the "cache miss resolution period" or "cache miss cost", 

unmodified copies in other memories are either invalidated 35 includes the time required to access the main memory in 

or updated. If invalidation is used, the relevant directory 114 addition to performing requisite housekeeping functions, 

or 118 sends messages only to the memories that contain the The processor that ordered the memory access operation 

unmodified copies to indicate that the copies are no longer which resulted in the cache miss is "stalled" during the cache 

valid. Where updating is used, copies of the modified block resolution period, and cannot execute any other instructions 

are sent to the memories in which the original copies were 40 until the memory access operation is completed, 

stored. Assuming a 100 Mhz clock rate, a cache memory access 

b. Limitations of Basic DSM Architecture operation can be typically performed in 10 ns, whereas a 

In applying genetic algorithms and other fitness improve- typical cache miss resolution period or cost is on the order 

ment operations to solving integrated circuit cell placement of 200 to 500 ns. If the cache miss rate is high and the 

and other optimization problems, an important issue is that 45 instructions being processed are memory intensive, the 

the computational requirements increase very rapidly with processing speed can be reduced to such an extent that the 

problem size. The size of the "DNA" or data structure system can operate at an effective clock rate of as low as 2 

representing a member of the population or placement MHz. 

increases with the problem size. The size of the population For this reason, the cache memory in a conventional DSM 

required to find the optimum placement also increases with 50 system is made sufficiently large to reduce the cache miss 

the problem size, so the memory requirements increase very rate to a level at which the processing speed is not unac- 

rapidly. ceptably degraded. However, a cache memory of conven- 

The time required to perform a fitness calculation tional size is too large to fit on a single integrated circuit chip 

increases with the size of the DNA, and the number of fitness together with a processor, main memory and the requisite 

calculations required per generation increases with the size 55 logic and control circuitry. 

of the population. The number of generations required to The problem is exacerbated by the fact that cache memory 

reach a solution increases with the size of the population. is conventionally implemented as SRAM, whereas main 

Thus, the computation time increases rapidly with prob- memory is implemented as DRAM. SRAM has a much 

lem size. Taking the memory requirements and computation lower gate or cell density than DRAM. For example, assum- 

time together, the computational requirements increase very 60 ing a CMOS process with a feature size of 0.5 //m, the 

rapidly with problem size. For example, using a genetic SRAM density is typically 2 kilobytes per square millimeter, 

algorithm to find an optimal placement of 9 cells takes a few whereas the DRAM density is 32 kilobytes per square 

seconds, 25 cells takes a few minutes, and 100 cells takes a millimeter. 

few hours, using an industry Standard Performance Evalu- The high latency and cache miss cost for main memory 

a lion Criteria (SPEC) 50 workstation. Using this approach to 65 access in a conventional multi-chip DSM system, even if a 

find an optimal placement of a state-of-the-art chip with large cache memory is provided to reduce the cache miss 

100,000 or more cells is not feasible. rate, reduce the effective processing speed to such an extent 
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that complicated processors are required to increase the investment analysis, currency arbitrage, weather 

processing speed to an acceptable value. forecasting, seismic and nuclear analysis and maintenance 

An example of such a processor is a "superscalar" pro- of complex databases, 

cessor that executes several instructions simultaneously \ n each application of the present method for producing 

using an asynchronous pipelining system. In addition to 5 an optimized solution to a problem, a methodology for 

being complicated and expensive, such processors are too solving the problem and/or data representing the problem 

large to fit on a single integrated circuit chip together with are decomposed into a plurality of tasks that are performed 

the other elements of a DSM node. simultaneously to produce a result for each task. The results 

c. Telecommunications Network Applications are then recomposed to produce an optimized solution to the 

Electronic data networks are becoming increasingly wide- 10 problem, 

spread for the communication of divergent types of data The optimized solution is analyzed to produce an 

including computer coded text and graphics, voice and evaluatior]) and the steps of performing the tasks, recom- 

video. Such networks enable the interconnection of large • the resu]ts and anaIyzing ^ optimized solution to 

numbers of computer workstations, telephone and television pTo6uce ^ evaluation are repeated to further optimize the 

systems, video teleconferencing systems and other facilities is optimized solution if the evaluation does not satisfy a 

over common data links or earners. predetermined criterion. 

Computer workstations are typically interconnected by 

local area networks (LAN) such as Ethernet, Token Ring, 2. Optimization Processing for Integrated Circuit 

DECNet and RS-232, whereas metropolitan, national and (iq Physical Design Automation 

international systems are interconnected by wide area net- 20 

works (WAN) such as Ti, V3.5 and FDDI. In a Physical design automation system for producing an 

Although effective, communication using these networks optimized cell placement for an integrated circuit chip, a 

is relatively slow, and a complicated and expensive network placement optimization methodology is decomposed into a 

interface adapter must be provided for each device that is to plurality of cell placement optimization processes that are 

be connected to a network. 25 performed simultaneously by parallel processors on input 

data representing the chip. 

SUMMARY OF THE INVENTION The results of the optimization processes are recomposed 

to produce an optimized cell placement. The fitness of the 

CONTENTS optimized cell placement is analyzed, and the parallel pro- 

30 cessors are controlled to selectively repeat performing the 

1. Generalized Optimization Processing Using Decompo- optimization processes for further optimizing the optimized 
sition and Simultaneous Processing cell placement if the fitness does not satisfy a predetermined 

2. Optimization Processing for Integrated Circuit (IC) cntenon. 

Physical Design Automation The system can be applied to initial placement, routing, 

3. Hierarchial Execution by Asynchronous Delegation 35 placement improvement and other problems. 
(HEADWARE) The processors can perform the same optimization pro- 

4. Integrated Circuit Cell Placement Representation cess on different placements, or on areas of a single place- 

5. Congestion Based Cost Function Computation me nt. Alternatively, the processors can perform different 
, . , ^ . „ „f . , ^ . optimization processes simultaneously on a single initial 

6. Improved Genetic Algorithms for Physical Design 40 placement> ^ the resulting pr0CesS ed placement having 
Automation the highest fitness being selected as the optimized place- 

7. Optimal Switching of Algorithms ment. 

8. Optimal Switching of Cost Functions The processors can further selectively reprocess areas of 

9. Simultaneous Placement and Routing (SPAR) 45 a placement having high cell interconnect congestion or 

10. Moving Windows other low fitness parameters. 

11. Chaotic Placement 3 Hierarchial Execution by Asynchronous 

12. Single Chip Distributed Shared Memory Node Delegation (HEADWARE) 

13. Single Chip Communications Node ^ , n accordance ^th a massively parallel simultaneous 

1. Generalized Optimization Processing Using P«**aw methodology of the present invention, a master 

Decomposition and Simultaneous Processing ° r host P T^nwipr^ < V ™" 

v & the present "HEADWARE concept, is first started. The 

The present invention provides a method of process team leader assigns tasks to worker processes and collects 

decomposition and optimization utilizing massively parallel 5S results. The present method uses very little computer time 
simultaneous processors that is especially suited to inte- and can service a large number of worker processes, 
grated circuit cell placement optimization. When a worker process is started, the first thing it does is 

The present method is not limited to any specific to send a message to the team leader requesting a task. The 
application, however, and can be advantageously applied to team leader then replies with a message assigning a task and 

optimization problems in a number of diverse areas such as 60 marks the task as having been assigned. Communication 
logic synthesis, circuit optimization (for minimum power, between the team leader and the worker then ceases, leaving 
etc.), software optimization, logistical problems such as the team leader free to communicate with other workers, 
traffic control and routing. It is not necessary for the team leader to record which 

In general, the present method can be utilized to obtain worker was assigned a particular task, or when the task was 

solutions to optimization problems having many simple or 65 assigned. An arbitrary number of workers can request tasks 
complex variables that are interrelated. For example, further in this manner, with the team leader assigning each worker 
applications of the invention include financial market and a previously unassigned task. 
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When a worker completes a task, it resumes communi- Cells having lowest fitness are selected for mutation, and 

cation with the team leader and identifies the task that it was transposed to random locations, to adjacent locations, with 

assigned, and the results that were obtained from performing cells having second worst fitness, to the center of mass of the 

the task. The team leader then records the results, marks the respective interconnect nets, or with two or more cells in a 

task as having been completed and assigns the worker 5 cyclical manner, 
another task. The team leader further preferably saves a copy 

of the task list on a computer disk or the like at periodic 7. Optimal Switching of Algorithms 

intervals as a precaution against failure of the team leader ^ o of mQre fitQess improvement algoritDms are 

process. available, and are optimally switched from one to the other 

4. Integrated Circuit Cell Placement Representation m accordance with an optimization criterion to maximize 

convergence of the placements toward the optimal configu- 

A large number of possible cell placements for an inte- ra tion. 
grated circuit chip are evaluated to determine which has the 

highest fitness in accordance with a predetermined criteria Optimal Switching of Cost Functions 

such as interconnect congestion. Each cell placement, which 15 

constitutes an individual permutation of cells from a popu- Two or more fitness (cost) calculation functions are 

lation of possible permutations, is represented as an initial available, and are optimally switched from one to the other 

cell placement in combination with a list of individual cell i° accordance with a optimization criterion, 
transpositions or swaps by which the cell placement can be 

derived from the initial cell placement. 20 9. Simultaneous Placement and Routing (SPAR) 

A cell placement can be genetically mutated and/or a method for optimizing a cell placement for an inte- 
inverted by adding swaps to the list for its cell placement grated circuit chip includes decomposing an initial place- 
which designates cells to be transposed. Genetic crossover me nt of cells into a hierarchial order of groups of cells. The 
can be performed by transposing swaps between the lists for groups are routed simultaneously using parallel processors, 
two cell placements. 25 and the results are recomposed to provide a global routing 
The present cell representation and transposition method that provides a detailed mapping of cell interconnect con- 
enables any type of cell transposition to be performed gestion in the placement. 

without loss or duplication of cells or generation of illegal Areas Q f high congestion are identified, and a congestion 

placements. 3Q reduction algorithm is applied using the parallel processors 

5. Congestion Based Cost Function Computation t0 al, ? 1 r J he P la <f «?ent in these areas simultaneously The 

° overall fitness of the placement is then computed, and if it 

The fitness of each integrated circuit cell placement is has not attained a predetermined value, the steps of identi- 

evaluated by dividing the placement into rectangular areas fying congested areas and applying the congestion reduction 

we call switch boxes that surround the cell locations respec- 35 algorithm to these areas are repeated, 

lively. A bounding box is constructed around each net of a ^ present mvention advantageously utilizes detailed 

netlist for the placement. A congestion factor is computed congestion information provided by the global routing, 

for each switch box, for example, as being equal to the However, global routing is very time consuming, and 

number of bounding boxes that overlap the respective switch impractical to perform after each local congestion reduction 

b° x * 40 iteration within the limits of current microelectronic circuit 

A cost factor for the placement and associated netlist, technology, 

which is an inverse measure of the fitness is computed as the ^ ^ invention avoids this problem by estimating 

maximum value, average value, sum of squares or other ^ ^^vs error created by altering the placement 

function of the congestion factors. wimout repeating routing> and repeating the global 

The individual congestion factor computations can be 45 routing only if the error exceeds a predetermined value. This 

modified to require that a terminal of a net of one of the enables a number of improvement operations to be per- 

bounding boxes overlap or be within a predetermined dis- formed and their results evaluated before another global 

tance of a switch box in order for the congestion factor to be routing is required, thereby greatly speeding up the optimi- 

computed as the sum of the overlapping bounding boxes in zation process. 

order to localize and increase the accuracy of the cost factor 50 ^ esem met hodology, in combination witb simulta- 

esttmation. The congestion factor for a swUch box can also ^ m ^ a Hcd t0 routi and fitncss 

be weighted in accordance with the proximity of the switch improvemenl and immcdiale fced back of improvement 

box to a terminal. results to the congestion reduction processing, reduces the 

6. Improved Genetic Algorithms for Physical lime required for placement optimization to a level that can 

Design Automation 55 De advantageously realized in a practical implementation. 

Cells for transposition or "swapping" within each place- iq Moving Windows 
ment using genetic algorithms are selected using, for 

example, greedy algorithms based on the fitness of each cell. One or more non-overlapping moving windows are posi- 

The cell fitnesses are evaluated in terms of interconnect 60 tioned over a placement of cells for an integrated circuit chip 

congestion, total net wire length or other criteria. to delineate respective subsets of cells. A fitness improve - 

Cells are selected for genetic crossover by sorting the meDt operation such as simulated evolution is performed on 

cells in order of fitness and multiplying the cell fitnesses by toe subsets simultaneously using parallel processors, 

weighting factors that increase non-linearly with rank. The The windows may be either moved to specifically iden- 

cells are selected using linear or random or pseudo-random 65 lifted high interconnect congestion areas of the placement, 

or patterned number generation such that cells with higher or are moved across the placement in a raster type or other 

fitnesses have a higher probability of selection. organized or random pattern such that each area of the 
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placement is processed at least once. Exchange of misplaced 
cells between subsets can be accomplished by dimensioning 
the windows and designing the window movement pattern 
such that the subsets overlap. Alternatively, such exchange 
can be accomplished by using two sets of windows of 5 
different sizes. 

As yet another alternative, the improvement operation can 
allow misplaced cells to be moved to a border area outside 
a window. Each misplaced cell is placed on a list, and then 
moved to the centroid of the group of cells to which it is 10 
connected, which can be outside the subset that originally 
included the misplaced cell. 

Dividing the chip into "moving windows" and optimizing 
the placement within each window reduces the time required 
to find a solution. It has two major advantages. By applying 15 
a genetic algorithm or other fitness improvement operation 
only to cells within the window, the size of the problem is 
much smaller, and the computational requirements are dra- 
matically reduced. Also, each window can be assigned to a 
different processor of a suitable multiprocessor computer, so 20 
the optimization of the windows can be done simultaneously 
in parallel, reducing the wall-clock time required to find the 
solution. 

11. Chaotic Placement 25 

In a "chaotic" placement method of the present invention, 
the fitness of a cell placement for an integrated circuit chip 
is optimized by relocating at least some of the cells to new 
locations that provide lower interconnect congestion. For 30 
each cell, the centroid of the group of cells to which the cell 
is connected is computed. The cell is then moved toward the 
centroid by a distance that is equal to the distance from the 
current position of the cell to the centroid multiplied by a 
"chaos" factor k. 35 

The value of X is selected such that the cell relocation 
operations will cause the placement to converge toward an 
optimal configuration without chaotic diversion, but with a 
sufficiently high chaotic element to prevent the optimization 
operation from becoming stuck at local fitness optima. 40 

The new cell locations can be modified to include the 
effects of cells in other locations, such as by incorporating 
a function of cell density gradient or force direction into the 
computation. This spreads out clumps of cells so that the 
density of cells is more uniform throughout the placement. 45 
The attraction between cells in the nets is balanced against 
repulsion caused by a high local cell density, providing an 
optimized tradeoff of wire length, feasibility and congestion. 

12. Single Chip Distributed Shared Memory Node 50 

The present invention overcomes the problems discussed 
above regarding conventional multi-chip Distributed Shared 
Memory (DSM) systems, and provides a complete DSM 
node that is integrated on a single integrated circuit chip. 55 

In accordance with the invention, the capacity of a cache 
memory is substantially reduced over that required for a 
multi-chip DSM implementation to enable the cache 
memory, a main memory, a processor and requisite logic and 
control circuitry to fit on a single integrated circuit chip. 50 

The increased cache miss rate created by the reduced 
cache memory capacity is compensated for by the reduced 
cache miss resolution period or cost resulting from integrat- 
ing the main memory and processor on the single chip. The 
reduced cache miss resolution period enables the processor 65 
clock rate to be substantially increased, so that a processor 
having a simple functionality such as a reduced instruction 
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set computer (RISC) processor can be utilized and still 
provide the required processing speed. 

The RISC processor is substantially smaller than a more 
complicated processor that would be required to provide the 
same processing speed in a multi-chip DSM 
implementation, thereby enabling the RISC processor to fit 
on the chip with the other elements. 

The smaller and less expensive RISC processor increases 
the number of processors that can be connected to a main 
memory of predetermined size. This increases the number of 
processors that can simultaneously operate on a problem 
defined by the main memory space and thereby increases the 
computational efficiency, and also reduces the amount of 
main memory that is required for each processor. This 
further enhances the ability of the present DSM node to be 
implemented on a single integrated circuit chip. 

13. Single Chip Communications Node 

The present invention provides a single-chip communi- 
cations node that can be used in telecommunications net- 
works other than DSM, and is faster in operation, simpler in 
construction and less expensive to manufacture and imple- 
ment than conventional network interfaces. 

The present communications node includes a memory 
controller for providing local and remote memory 
coherency, and a bidirectional interconnect unit that con- 
verts memory access instructions into memory access mes- 
sages and vice- versa. 

The above and other objects, features and advantages of 
the present invention will become apparent to those skilled 
in the art from the following detailed description taken with 
the accompanying drawings, in which like reference numer- 
als refer to like parts. 

DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a simplified diagram illustrating an integrated 
circuit chip which can be optimally designed in accordance 
with the present invention; 

FIG. 2 is a diagram illustrating a basic genetic crossover 
operation and the drawbacks thereof; 

FIGS. 3 and 4 are diagrams illustrating a form of cost 
factor estimation method; 

FIG. 5 is a diagram illustrating a Distributed Shared 
Memory (DSM) parallel processing architecture of the 
present invention; 

FIG. 6 is a flowchart illustrating an optimization process 
decomposition and parallel processing method of the present 
invention; 

FIG. 7 is a functional diagram that further illustrates the 
method of FIG. 6; 

FIG. 8 is a diagram illustrating the main blocks of a 
multi-processing optimization system of the present inven- 
tion that operates in accordance with the method of FIGS. 6 
and 7; 

FIG. 9 is a block diagram illustrating a DSM architecture 
including different types of processors for practicing the 
invention; 

FIG. 10 is a block diagram illustrating a fail-safe distrib- 
uted processing or HEADWARE method for practicing the 
present invention; 

FIG. 11 is a diagram illustrating a list of tasks in the 
process of being performed using the method of FIG. 10; 

FIG. 12 is a flowchart illustrating the distributed process- 
ing method of FIGS. 10 and 11; 
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FIG. 13 is a diagram illustrating a location/location swap FIG. 50 is a flowchart illustrating a method of simulta- 

operation utilizing a cell placement and transposition system neons placement and routing of the present invention; 

of the present invention; FIG. 51 is a diagram illustrating a method of identifying 

FIG. 14 is a diagram that similarly illustrates a cell/cell cells having high interconnect congestion; 

swap operation; 5 FIG. 52 is a diagram illustrating a method of relocating a 

FIG. 15 is a diagram illustrating a location/cell swap cell having high interconnect congestion such that the con- 
operation; gestion is reduced; 

FIG. 16 is a diagram illustrating a cell/location swap FIG. 53 is a diagram illustrating another method of 

operation; relocating a cell to reduce interconnect congestion; 

FIG. 17 is a diagram illustrating a row swap operation; FIGS. 54 to 56 are diagrams illustrating an optimization 

FIG. 18 is a diagram illustrating a column swap operation; processing method of the invention using moving windows 

FIG. 19 is a diagram illustrating a roll up operation; in which misplaced cells are exchanged by overlap between 

FIG. 20 is a diagram illustrating a roll right operation; windows; 

FIG. 21 is a diagram illustrating a move block operation; 15 FIGS. 57 to 59 are diagrams illustrating another optimi- 

FIG. 22 is a diagram illustrating a rotate block clockwise zation Processing method using moving windows in which 

operation* misplaced cells are exchanged between two sets of windows 

"%i • j* mi * *• • *• of different sizes; 
FIG. 23 is a diagram illustrating an inversion operation; 

iti/^ <*a * j • -ti « *• ** FIGS. 60 and 61 are diagrams illustrating another opti- 

FIG. 24 is a diagram illustrating a genetic crossover . ... . . , . 

o eration* 20 mization processing method using moving windows in 

r-^o \~ ~« j. .„ • which misplaced cells are exchanged by allowing misplaced 

FIGS. 25 to 28 are diagrams illustraung a congestion ^ tQ move ^ a bordef m around a ^tow and 

based cost factor estimation method of the present invention; subsequently be relocated t0 positions outside me ^nfow; 

H0 29 is a flowchart illustrating the basic genetic nG fi2 ^ a diagfam mustrating a method of optimally 

a ^ on ' 25 relocating a cell in a placement using a chaotic optimization 

FIG. 30 is a graph illustrating the relative fitness of cells method of the inve ntion; 

in an exemplary placement when ranked in order of fitness; . ,. ... A A . r 

r^^^^- tMi • . <> FIG. 63 is a diagram illustrating computation or a center 

FIG. 31 is a graph illustrating the relative fitness of cells of it of a ^ ne , fof practicing me method of nG 62; 

in the exemplary cell placement in accordance with a „_ ,. . .„ . , , , „_„ 

statistical selection method of the invention; *> 6 * 1S a ^f 1 how ' he me *° d ° f HG. 

„ . ... . , „ , 62 can be modified to mclude the effects of a density 

FIG. 32 is a diagram illustrating random cell selection gradient* 

utilizing the statistical selection method illustrated in FIG. l m . .„ . , . , . 

FIG. 65 is a diagram illustrating how the method of FIG. 

' „ . „ u _* -ii * *■ r 62 can be modified to include the effects of forces resulting 

FIG. 33 is a flowchart illustrating a uniform crossover „ r • t . , & 

.... . ..*••! i *i_ j * rrr* 35 fr° m other cells m the placement; 

operation utilizing the statistical selection method of FIG. . r 

31 . FIG. 66 is a vector diagram illustrating the method of FIG. 

' 62* 

FIGS. 34a and 346 in combination constitute a listing of ' 

a computer simulated cell placement operation utilizing p IG. 67 is a block diagram illustrating a single chip 

uniform crossover and the present statistical selection M integrated circuit Distributed Shared Memory (DSM) node 

method; °f tne P resent invention; 

FIG. 35 is a table listing an optimal cell placement FIG. 68 is a block diagram illustrating a computing unit 

produced by the simulation of FIGS. 34a and 346; of me P resent DSM node J 

FIG. 36 is a graph illustrating the performance of the FIG. 69 is a block diagram illustrating a memory con- 
simulation of FIGS. 34a and 346; 45 trailer of the DSM node; 

FIG. 37 is a diagram illustrating a greedy crossover FIG. 70 is a block diagram illustrating an interconnect 

operation in accordance with the present invention; interface of the DSM node; and 

FIGS. 38 to 43 are diagrams illustrating greedy mutation FIG. 71 is a block diagram illustrating a single chip 

operations of the invention; integrated circuit communications node of the present inven- 

FIG. 44 is a graph illustrating the characteristics of two 50 ^on. 

placement fitness optimization processes that can be opti- nPSCRIPTION OF thp 

mally switched in accordance with the present invention; DEI AILED D^CRJPnON OF THE 

INVENTION 

FIG. 45 is a graph illustrating optimal switching between 
the two optimization processes shown in FIG. 44; 

FIG. 46 is a graph illustrating the relationship between the 

numbers of cell placements and their corresponding con- L Generalized Optimization Processing Using Decompo- 

gestion and wire length based cost function for an exemplary silion and simultaneous Processing * 

population of cell placements; . . „ 

FIG. 47 is a graph illustrating the relationship of FIG. 46 60 2 ' °V Um ™}} on ******* for Integrated Circuit (IC) 

in the form of Two separate curves; Ph y sicaI Au ™°" 

FIG. 48 is a graph illustrating optimal switching from a 3 " "^nwARS^^ 00 ^ Asynchr0nous Dele S ation 

congestion based cost function to a wirelength based cost (HEADWARE) 

function in accordance with the present invention; 4. Integrated Circuit Cell Placement Representation 

FIGS. 49a to 49c in combination constitute a listing of a 65 5 - Congestion Based Cost Function Computation 

computer simulated cell placement operation utilizing the 6. Improved Genetic Algorithms for Physical Design 

cost function switching method illustrated in FIG. 48; Automation 
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a. Basic Algorithms 

b. Statistical Selection 

c. Greedy Crossover 

d. Greedy Mutation 

7. Optimal Switching of Algorithms 5 

8. Optimal Switching of Cost Functions 

9. Simultaneous Placement and Routing (SPAR) 

10. Moving Windows 

11. Chaotic Placement 

12. Distributed Shared Memory Implementations 

a. Single Chip Processor Node 

b. Single Chip Communications Node 

1. Generalized Optimization Processing Using 
Decomposition and Simultaneous Processing 15 

The present invention provides a method of process 
decomposition and optimization utilizing massively parallel 
simultaneous processors that is especially suited to inte- 
grated circuit cell placement optimization. This application 
will be described in detail in order to clearly present the 20 
concepts of the invention. 

The present method is not limited to any specific 
application, however, and can be advantageously applied to 
optimization problems in a number of diverse areas such as 
logic synthesis, circuit optimization (for minimum power, 25 
etc.), software optimization, logistical problems such as 
traffic control and routing. 

In general, the present method can be utilized to obtain 
solutions to optimization problems having many simple or 3Q 
complex variables that are interrelated. For example, further 
applications of the invention include financial market and 
investment analysis, stock and currency arbitrage, weather 
forecasting, seismic, nuclear and chemical analysis and 
maintenance of complex databases. ^ 

In each application of the present method for producing 
an optimized solution to a problem, a methodology for 
solving the problem and/or data representing the problem 
are decomposed into a plurality of tasks that are performed 
simultaneously and/or in parallel to produce a result for each ^ 
task. The results are then recomposed to produce an opti- 
mized solution to the problem. 

The optimized solution is analyzed to produce an 
evaluation, and the steps of performing the tasks, recom- 
posing the results and analyzing the optimized solution to 45 
produce an evaluation are repeated to further optimize the 
optimized solution if the evaluation does not satisfy a 
predetermined criterion. 

For the purposes of the present invention, the word 
"simultaneously" is defined as two or more tasks being 50 
performed concurrently (at the same time). The word "par- 
allel" is defined as two or more tasks being performed 
independently. Since it is possible for some tasks to have to 
wait for results of other tasks that are being performed in 
parallel, it is within the scope of the invention to perform 55 
tasks in parallel, but not necessarily simultaneously. In 
addition, some processors can be working on housekeeping 
tasks such as supervision, statistical analysis or memory 
management rather than working on a direct aspect of the 
main problem. 60 

The present optimization process decomposition and par- 
allel processing method is illustrated in the form of a 
simplified flowchart in FIG. 6 and a functional diagram in 
FIG. 7, and comprises the following steps. 

(a) Input the problem to be solved, including the data 65 
defining the problem, the algorithms, rules and other 
applicable constraints, and the objective to be achieved. 
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(b) Decompose the optimization processing methodology 
and/or data into a plurality of processes that can be 
performed simultaneously and/or in parallel. 

(c) Perform the processes using respective parallel 
processors, with one or more processors coordinating 
the operation of other processors. 

(d) Recompose the results of performing the processes to 
produce an optimized solution, and evaluate the solu- 
tion on the basis of a predetermined criterion. 

(e) Determine if the objective has been satisfied. If so, the 
process is completed. If not, the optimization processes 
are refocussed to further optimize the solution. More 
specifically, decomposition, optimization processing, 
recomposition, evaluation, and control of repeatedly 
performing selected optimization processes on selected 
areas of the problem to further optimize the solution are 
distributively applied using parallel processing. 

2. Optimization Processing for Integrated Circuit 
(IQ Physical Design Automation 

FIG. 8 illustrates an integrated circuit physical design 
automation system 130 that constitutes a specific application 
of the process decomposition and parallel processing 
method of the present invention as described above with 
reference to FIGS. 6 and 7. 

The system 130 receives inputs for a user specified 
integrated circuit design including a netlist, a library of 
standardized microelectronic elements or cells and func- 
tional units including combinations of cells, and a set of 
rules that define the objectives of the design. 

The system 130 decomposes these inputs into a plurality 
of parallel processes that are executed simultaneously using 
individual processing units as will be described in detail 
below. In general, one or more processors coordinate the 
operation of other processors, which are optimized, evalu- 
ated and recombined to produce an optimal cell placement 
which may or may not satisfy a predetermined performance 
objective. 

If the objective is reached, the optimal cell placement that 
was produced by the system 130 is used to generate masks 
for fabrication of the desired integrated circuit chip. If not, 
the initially produced optimal cell placement is fed back to 
the parallel processors which refocus the optimization func- 
tion for improving the placement. 

The integrated circuit physical design automation system 
130 comprises a global operating system 132 that generally 
controls and coordinates the operation of headware 134 and 
simultaneous processing architecture 136. 

The architecture 136 includes a plurality of parallel pro- 
cessors and a memory structure for simultaneously execut- 
ing a plurality of genetic and other algorithms 138 for 
comparing the relative fitnesses of a large number of pos- 
sible cell placements and determining the placement that has 
the highest fitness. Implementation of the algorithms 138 is 
facilitated by a unique cell placement representation 140 and 
cost function or factor computation 142. These elements will 
be described in detail below. 

The architecture 136 can be of any type that enables 
parallel processing in accordance with the method of the 
invention. A DSM arrangement such as described above 
with reference to FIG. 5 is especially suitable for practicing 
the invention since the results produced by the processors 
can be recomposed using shared memory. 

The processors can be identical as illustrated in FIG. 5, or 
they can be different. The architecture 130 as illustrated in 
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FIG. 9 comprises a plurality of parallel processing nodes are used to repeat optimization processing of the congested 

144, 146, 148 and 150 and a shared memory 152, each of areas. The moving windows feature can be combined with 

which includes a directory based cache coherency unit as the SPAR system to delineate the areas for reprocessing, 

described above with reference to FIG. 5. The nodes 144, A chaotic fitness improvement method is another form of 

146, 148 and 150 each comprise a processor and a local 5 the present method, in which cells are relocated in parallel, 

memory, and have access to the shared memory 152 and the ana - me altered placement evaluated in terms of fitness. If the 

memories of all of the nodes via their cache coherency units fitness has not been sufficiently improved, the parallel cell 

and a bus 154. relocation operations are repeated based on new congestion 

Each processor 144, 146, 148 and 150 is selected as data, 

having unique characteristics and excelling at different kinds 10 

of tasks. One processor, for example, can operate at very 3 * Hicrarchial Execution by Asynchronous 

high speed but be relatively inefficient at handling a variety Delegation (HEADWARE) 

of input/output protocols, whereas another processor can FIGS. 10 to 12 illustrate a "HEADWARE" method of 

have the opposite characteristics. distributed processing, including a fail-safe mechanism that 

In the illustrated example, the nodes 144 and 146 each is makes the system immune to the failure of individual 

comprise a 386 microprocessor that operates at 25 MHz, and processors. 

a two megabyte local memory. The node 148 comprises a p r j or ar t distributed processing schemes suffer from draw- 

486 microprocessor that operates at 60 MHz, and a 4 backs including difficulty in varying the number of 

megabyte local memory. The node 150 comprises a MIPS processors, failure or crashes of individual processors, and 

R4000 microprocessor that operates at 150 MHz, and two 20 opt i ma i processor utilization in various diversified applica- 

megabytes of local memory. tions. 

Typically, the node 148 will be utilized to control the present method is applicable to a large class of 

nodes 144, 146 and 150 to perform tasks in parallel. The problems, in addition to utilization in the present physical 

nodes 144 and 146 will be used for relatively simple tasks, design automation system 130, in which the computation 

whereas the node 150 will be used for computationally 25 can be divided into a large number of weakly coupled tasks 

intensive tasks. eacn taking a minute or more to calculate and whose results 

The method of FIGS. 6 and 7 can be applied in a variety can be reported in a relatively short message. Examples 

of ways using the system 130. For example, a single initial include design analysis, global routing, detailed routing, test 

placement can be generated, and different algorithms, such sequence generation, etc. 

as genetic alteration and simulated annealing, applied to the 30 i n genera i ( a mas ter or host process, which can be referred 

initial placement using respective parallel processors. t0 ^ a team j eader m the headware concept, is first started. 

The fitnesses of processed placements that result from xta team leader assigns tasks to worker processes and 
applying the different algorithms to the initial placement are collects results. The present method uses very little corn- 
then evaluated, and the processed placement having the puter tmie an d can service a large number of worker 
highest fitness is designated as the optimized placement. processes. 

The processes can also be monitored, and the processes When a worker process is started, it sends a message to 

and/or cost functions switched during processing in accor- the team leader requesting a task. The team leader then 

dance with a predetermined criterion. replies with a message assigning a task and marks the task 

Another aspect of the present method comprises gener- ^ as having been assigned. Communication between the team 

ating and processing a plurality of initial placements in leader and the worker then ceases, leaving the team leader 

parallel using a single algorithm such as simulated annealing free to communicate with other workers, 

or genetic mutation. Again, the resulting processed place- i n accordance with the present invention, it is not neces- 

ments are evaluated, and the best placement is selected for sarv for the team leader to record which worker was 

further processing. 45 assigned a particular task, or when the task was assigned. An 

A single initial placement can also be generated and arbitrary number of workers can request tasks in this 

divided into areas or groups of cells, and the parallel manner, with the team leader assigning each worker a 

processors used to simultaneously apply optimization algo- previously unassigned task. 

rithms to the areas or groups. The initial placement can be When a worker completes a task, it resumes communi- 

divided into contiguous non-overlapping areas, or into 50 cation with the team leader and identifies the task that it was 

groups of cells in accordance with the netlist or other assigned, and the results that were obtained from performing 

hierarchical organization. For example, parallel processors the task. The team leader then records the results, marks the 

can be assigned to operate on the nets of the netlist respec- t ask as having been completed and assigns the worker 

tively. another task. The team leader further preferably saves a copy 

A moving windows feature of the invention as will be 55 of the task list on a computer disk or the like at periodic 

described below is a specific implementation of the present intervals as a precaution against failure of the team leader 

decomposition and parallel processing method. Each win- process. 

dow delineates a subset of cells, and the subsets are assigned Eventually, a worker requests a task, and all tasks arc 

to respective parallel processors. either marked as assigned or completed. If all tasks are 

A Simultaneous Placement And Routing (SPAR) method 60 completed, the optimization process is finished and, in the 

as will be described below is another example of the present case of the present physical design automation system 130, 

decomposition and parallel processing method. The general the results are recomposed to produce the optimal cell 

method can be applied to initial placement, global or placement. If there are tasks that are marked as being 

detailed routing and/or to simultaneous placement and rout- assigned but not completed, the possibility exists that one or 

i°g- 65 more of the workers to whom the tasks were assigned had a 

In the SPAR methodology, the areas of high cell inter- processor failure, crashed or was shut down to free the 

connect congestion are identified, and the parallel processors processor for other uses. 
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When such an event occurs, all assigned but uncompleted accompanied by a result, the team leader processor 148 

tasks are redesignated as unassigned, and are reassigned to stores the result, and assigns the next unassigned task to the 

other workers as they become available. This reassignment requesting worker processor 144, 146 or 150. 

ensures that all tasks are eventually completed regardless of The team leader processor 148 also marks or redesignates 

processor failures. The present method of distributed pro- 5 the task that was just completed in the list 160 as completed, 

cessing allows efficient use of a variable number of proces- and redesignates the task that was just assigned as being 

sors that can be added or removed as they become available. assigned. If the work request is not accompanied by a result, 

As illustrated in FIG. 10, the simultaneous processing as occurs upon process initialization, the team leader pro- 
architecture 136 includes a plurality of processors such as cessor 148 assigns the next unassigned task to the requesting 
described above with reference to FIG. 9. In the exemplary io worker processor 144, 146 or 150, and redesignates the task 
implementation illustrated in the drawing, the processor 148 as being assigned. 

acts as a team leader or host processor, whereas the proces- The team leader processor 148 also tests to determine if 

sors 144, 146 and 150 act as worker processors. any tasks remain unassigned. If so, the unassigned tasks are 

A process decomposition and recomposition unit 158 assigned to the worker processors 144, 146 and 150 in 

decomposes an optimization process that is to be performed 15 response to work requests therefrom. If not, the team leader 

to produce an optimal cell placement from a population of processor 148 tests to determine if any tasks remain unas- 

initiat cell placements into tasks that can be performed signed. 

independently. The optimization processes that can be if no unassigned tasks are present, then all tasks must 

decomposed and performed using the present distributed have been completed. When this occurs, the results are 

processing method of the invention are not limited to any 20 uploaded to the unit 158 for recomposition and generation of 

particular categorization, and can include simulated the optimal cell placement, and the distributed processing 

evolution, annealing or mutation, constructive placement, operation is terminated. 

force directed placement, or any other type of process that If no unass ig ne d tasks are present and one or more 

can be decomposed into parallel tasks. The present method assigned tasks remain present, there is an indication that the 

can also be applied to performing two or more complete worker processors 144, \u and 150 that were assigned the 

optimization processes in parallel using respective proces- remaining assigned tasks have failed, crashed or were appro- 

sors * priated for another use. However, this does not adversely 

As illustrated in the flowchart of FIG. 12, the unit 158 affect the operation of the architecture 136. When such a 

decomposes the optimization process to be performed into 3Q condition is detected, the team leader processor 148 merely 

tasks, which are downloaded by the team leader processor marks or redesignates the assigned tasks as being unas- 

148. The control process that is performed by the processor signed. The newly unassigned tasks are then assigned to 

148 is then initiated, as well as worker processes that are requesting worker processors 144, 146 and 150 in the 

performed by the worker processors 144, 146 and 150. manner described above. The process terminates when the 

Upon initialization, the team leader processor 148 goes 35 team leader processor 148 determines that the list 160 does 

into a loop in which it looks for a work request from the not include any unassigned or assigned tasks, but only 

worker processors 144, 146 and 150. Upon initialization, the completed tasks. 

worker processors 144, 146 and 150 send work requests to The criterion for redesignating assigned tasks as being 

the team leader processor 148. Although not illustrated in unassigned in the process as illustrated in FIG. 12 is that the 

detail, the architecture 136 includes an arbitration mecha- ^ list 160 does not include any unassigned tasks, but includes 

nism that ensures that the team leader processor 148 will a t least one assigned task. However, the invention is not so 

communicate with only one worker processor 144, 146 and limited, and other criterion can be utilized for causing the 

150 at any one time, and that collisions between incoming team leader processor 148 to redesignate assigned tasks as 

work requests are prevented. unassigned tasks and therefore compensate for a failure of 

The team leader processor 148 stores a task list 160 as 45 one or more of the worker processors 144, 146 and 150. For 

illustrated in FIG. 11 in an appropriate location in memory. example, this operation can be performed if a predetermined 

The task list 160 includes an entry for each task that was length of time has elapsed after initialization of the process, 

downloaded from the unit 158, including an identifier of the or if a predetermined length of time has elapsed after 

task (TASK 1, TASK 2 . . . TASK N), and a code that assigned tasks have been previously redesignated as unas- 

indicates the status of the task. For example, code 0 indicates 50 signed tasks. 

that the respective task is unassigned, code 1 indicates that The global operating system 132 and the headware 134 

the task has been assigned but not completed, and code 2 are programs that run on the processors 144, 146, 148 and 

indicates that the task has been completed. 150, The headware 134 is designed to decompose the cell 

Upon receipt of a work request, the team leader processor placement problem into individual tasks that can be run 

148 assigns the next unassigned task in the list 160 to the 55 simultaneously in parallel on the processors 144, 146, 148 

worker processor 144, 146 or 150 that generated the respec- and 150. For example, the genetic algorithm can be run on 

tive work request. The assigned worker processor 144, 146 a plurality of placements using respective processors, and 

or 150 then terminates communication with the team leader the results subsequently compared. The migration operation 

processor 148 and begins to perform the assigned task. The can be utilized in this arrangement as disclosed in the above 

team leader processor 148 does not make any further attempt 60 referenced article to Mohan. 

to communicate with the assigned worker processor 144, The processors 144, 146, 148 and 150 are selectively 
146 or 150 until it receives a subsequent work request utilized to perform the required operations and subopera- 
therefrom. tions for physical design automation. For example, a par- 
After completing an assigned task, each worker processor ticular processor can be used at different times under soft- 
144, 146 and 150 sends a work request to the team leader 65 ware control to function as a bounder for computing 
processor 148 requesting a new task, together with the bounding boxes for cost factor computation, a selector for 
results of the task just completed. If the work request is selecting cells for mutation, a transposition processor for 
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performing cell swaps, or might perform all of these opera- The swap (2),(4) applied to the placement 174 results in 

tions for one of the placements being evaluated. swapping of the cells 3 and 4 to produce an intermediate 

placement 176, whereas the swap (1),(4) applied to the 

4. Integrated Circuit Cell Placement Representation placement 176 results in swapping of the cells 2 and 3 to 

5 produce the final placement 164. 

The problems described above with reference to FIG. 2 Each resulls m me transposition of two cells. No 

are solved in accordance with the present invention, cells are ever lost or duplicated, but merely moved around, 

enabling genetic crossover as well as all other genetic jhfs one-to-one relationship between swaps and cells 

transposition or swapping operations to be performed with- accomplishes a goal of the present invention, in that it 

out modification. This goal is accomplished by, for example, j() enables all genetic operations, including crossover, to be 

utilizing the unique integrated circuit cell placement repre- performed in their basic form with no possibility of gener- 

sentation 140 as illustrated in FIG. 8. a ting illegal placements. 

Although the present placement representation and trans- The placement 164 as represented by the branch 172 

position method is especially suited to the integrated circuit consists of the initial placement (1),2;(2),3;(3),4;(4)1 and a 

cell placement optimization problem, it is not so limited, and 35 list of swaps consisting of (1),(4);(1),(3);(1X2) which result 

can generally be applied to any application for representing m intermediate placements 178 and 180 and the placement 

permutations of any types of entities. J 64 respectively Tbe placement 164 as reprinted by the 

. . ♦ ■ ii i * branch 170 consists of the initial placement (1)^;(2)3;(3), 

In accordance with the present invention, a cell placement and & ^ of mn J^ g of (1),(4);(1),(2); 3 , 

or other permuta ion of entities is not necessary repre- ; ( 2 ),(3);(1),(2) which result in intermediate placements 

sentedbyalist of locations and corresponding cells as in the 20 ^ 1^186 and 188 and the placement 164 respectively, 

prior art, but is preferably represented by an initial place- _ . , . . . 

ment or permutation and a list of transpositions or "swaps" f ^ ^ ' .T, *" **' ™? ? 3 n 

by which the representation can be derived from the initial ° f N J ( N fact ° naI ) Pigments or permutations of the cells 

« In the exemplary case of four locations, a total or 4 =24 

representation. . r ; ... „ * , 

placements are possible. However, it can be proven matb- 

As illustrated 1 in FIG. 13 an initial placement or permu- 25 ematicall that for each ^ of N locationSf each possible cell 

tation 162 includes four cell identifiers 2, 3, 4 and 1 assigned placement can te represented by the initial placement and a 

to locations (1) to (4) respectively. The initial placement 162 maximiim of N-1 swaps 

can be represented by the list (1)2^2)3,(3)4,(4)1 in .which , t ^ ^ ^ of me {Q m each 

the numbers in parentis represent locations and the bare ^ ^ ^ ^ ^ a ^ number of 

numbers represent cell identifiers. 30 ^ ^ ^ CN c . g & 

In an actual integrated circuit chip application, there will or by the mitial placement and a variable number of swaps, 

typically be more cell locations than cells. In this case, a In the former case> it ^ and pro bable for many of 

number of dummy or "idle" cells are added to increase the me swaps in the ^ t0 be ^ or null ^ max imum 

number of cells to be equal to the number of locations. For num ber of swaps which are actually required to derive a 

the purpose of explaining the principles of the invention, it particular placement can be zero (the initial placement) or 

will be assumed that the numbers of locations and cells are any numbcr from x to N-1 In addition , there will be an odd 

e( l ua l- or even number of swaps or parity for each placement which 

The reference numeral 164 designates a placement which can be used for error checking purposes, 

is derived from the placement 162 by a plurality of cell ^ i n t h e example of FIG. 13, the swaps were specified in 

transpositions or swaps. The swaps by which the placement location/location format. It is further within the scope of the 

164 can be derived from the placement 162 are not limited invention to specify swaps in the form of cell/cell, location/ 

to one factorization or set. In the illustrated example, the ce H and cell/location. All of these formats are supported by 

placement 164 is derived from the placement 162 using maintaining a table or list of cell locations and the cell 

three sets of swaps. 45 identifiers corresponding to the respective cell locations in 

Hie arrangement of FIG. 13 can be considered as a tree an electronic memory. Thus, if a cell is specified for a swap, 

166, with each placement representing a node and each swap the location in which the cell is assigned can be readily 

representing an edge that connects two adjacent node. The determined. The format of this table is simply a list of 

tree 166 has three branches 168, 170 and 172, representing locations and cells. For example, a table for the placement 

three sets or lists of swaps by which the placement 164 can 50 164 would consist of the entries (1),3;(2),4;(3),1;(4),2. 

be derived from the placement 162. FIG. 14 illustrates a series of cell/cell swaps which derive 

The left branch 168 and the right branch 172 each consist a placement 190 from an initial placement 192. The first 
of the required minimum number (N-l) of swaps, in this illustrated swap is 3,4. Since cell 3 is initially in location (2) 
4-1-3, to represent the placement 164. The center branch and cell (4) is initially in location (3), the cell/cell swap 3,4 
170 consists of five swaps, which is more than the minimum 55 is equivalent to a (2),(3) location/location swap, and pro- 
required number. duces an intermediate placement 194. The next swap is 2,4, 

More specifically, the placement 164 as represented by the wh i c h is equivalent to a (1),(2) location/location swap, and 

branch 168 consists of the initial placement (1),2;(2),3;(3), produces an intermediate placement 196. The last swap is 

4;(4),1 and a list of transpositions or swaps consisting of the M, and produces the placement 190. 

elements (3),(4);(2),(4);(1),(4). These swaps are location/ 60 It will be noted that the initial placement 192 in FIG. 14 

location swaps. For example, the swap (3),(4) means that the is the same as the initial placement 162 in FIG. 13, and that 

cells in locations (3) and (4) are transposed or swapped. This the numerical values of the swaps in FIG. 14 are the same 

swap produces an intermediate placement 174 as illustrated as the numerical values of the swaps in the left branch 168 

in FIG. 13, in which the cells 4 and 1 that are in locations in FIG. 13. However, the placements 164 and 190 that are 

(3) and (4) in the placement 162 are swapped such that they 65 produced by these swaps are different, 

are in locations (4) and (3) respectively in the placement An example of a location/cell swap is illustrated in FIG. 

174. 15, and utilizes an initial placement 198 that is the same as 
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in the examples of FIGS. 13 and 14, and the same numerical 
values for the swaps. The first swap is (3),4, which indicates 
that whatever cell is in location (3) should be swapped with 
cell 4. However, in this example, cell 4 is already in location 
(3). Thus, the first swap produces an intermediate placement 
200 that is the same as the initial placement 198. 

The next swap is (2), 4, which indicates that whatever cell 
is in location (2) should be swapped with cell 4. In this case, 
cell 3 is in location 4, and is swapped with cell 4 that is in 
location (2). The (2),4 location/cell swap is equivalent to a 
(2),(3) location/location swap, and produces an intermediate 
placement 202. The last swap is (1),4, and produces a 
placement 204 that is different from the placements 164 and 
190 of FIGS. 13 and 14 respectively. 

FIG. 16 illustrates an example of a cell/location swap 
sequence, using the same initial placement, here designated 
as 206, and numerical swap values as in the previous 
examples. The first swap is 3,(4), which indicates that cell 3 
should be swapped for whatever cell is in location (4). Since 
cell 3 is in location 2, the 3,(4) cell/location swap is 
equivalent to a (2),(4) location/location swap, and produces 
an intermediate placement 208. A 2,(4) swap produces an 
intermediate placement 210 in the same manner, whereas a 
1,(4) swap produces a placement 212. It will be noted that 
the placement 212 is different from the results of the 
previous examples. An application of cell/location swap is 
presented in FIGS. 34a and 34b. 

Since the four types of swaps produce different results, it 
is possible that switching from one type of swap to another 
could increase the convergence rate in a particular cell 
placement application. It is therefore desirable to provide a 
convenient mechanism by which the desired type of swap 
can be designated and executed. An example of such a 
system is presented in the following table. 

TABLE 



OPERATOR OPERATION 


OPERAND 1 


OPERAND 2 


1 


Location/Location Swap 


Loc 1 


Loc 2 


2 


Cell/Cell Swap 


Cell 1 


Cell 2 


3 


Location/Cell Swap 


Loc 1 


Cell 1 


4 


Cell/Location Swap 


Cell 1 


Loc 2 


5 


Row Swap 


Row 1 


Row 2 


6 


Column Swap 


Col 1 


Col 2 


7 


Roll Up 


Start Row 


End Row 


8 


Roll Down 


Start Row 


End Row 


9 


Roll Right 


Start Col 


End Col 


10 


Roll Left 


Start Col 


End Col 


11 


Move Block 


Start Loc 


End Loc 


12 


Rotate Block CW 


Start Loc 




13 


Rotate Block CCW 


Start Loc 




14 


Invert 


Start Loc 


End Loc 



Each swap operation can be designated by an operator and 
one or two operands. The operator for a location/location 
swap is 1. To specify the location/location swap (3),(4), a 
command to the processor would be 1,3,4, in which the 
operands are 3 and 4. Although the operands are not 
enclosed in parenthesis, the system knows that they are to be 
considered as locations rather than cells because the operator 
designates a location/location swap. 

In an essentially similar manner, a 3,4 cell/cell swap 
would be designated as 2,3*4, a (3),4 location/cell swap 
would be designated as 3,3,4 and a 3,(4) cell/location swap 
would be designated as 43,4. 

The single cell swaps may be used in genetic mutation 
operations such as simulated annealing. However, the 
present representation and transposition method is not so 
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limited, and can be advantageously utilized to perform 
swaps of entities consisting of two or more cells. 

FIG. 17 illustrates how rows 1 and 3 can be transposed or 
swapped in response to the command 5,13, where 5 is the 
operand for a row swap, and 1 and 3 are the operands 
indicating the rows to be swapped. The row swap is executed 
as a series of (1),(7);(2),(8);(3),(9) location/location swaps. 
However, it will be understood that the operation could 
alternatively be performed using cell/cell swaps, location/ 
cell swaps, cell/location swaps or a combination thereof. 

FIG. 18 illustrates how columns 1 and 2 can be swapped 
in response to a command 6,1,2, where 6 is the operator for 
a column swap, and 1 and 2 are the operators indicating the 
columns to be swapped. The operation is performed using 
the location/location swaps (1),(2);(4),(5);(7),(8). 

Another exemplary transposition operation is illustrated 
in FIG. 19, and consists of rolling rows 1 to 3 upwardly such 
that the original row 1 is wrapped down around to row 3. 
The command is 7,13, where 7 is the operator for roll up, 
1 is the upper or start row and 3 is the lower or end row. The 
individual swaps are listed in the drawing. A roll down 
operation, which is not illustrated, is executed in an essen- 
tially similar manner in response to the operator 8. 

FIG. 20 illustrates a roll right operation which is similar 
to the roll up operation, except that it is performed on 
columns rather than rows. The illustrated operation is per- 
formed in response to a command 9,2,4, where 9 is the 
operator for roll right, 2 is the start column and 4 is the end 
column. The individual swaps are listed in the drawing. A 
roll left operation which is executed in response to the 
operator 10 is essentially similar. 

In addition to storing cell locations and the corresponding 
cell identifiers, block identifiers or tags can be stored in 
memory. Each cell of a contiguous block of cells which is to 
be considered as a unit is given a block identifier. Each time 
a cell is designated as an operand in a transposition 
command, the block identifiers are checked to determine if 
the command designates all cells in the block to be trans- 
posed together. If not, the command is rejected or modified. 

This enables cell blocks that constitute integral logic 
elements to be transposed around the placement, but pre- 
vents the block from being broken up. It is further within the 
scope of the invention to designate whether or not the 
orientation of a particular block is critical. If not, the blocks 
can be rotated, inverted or subjected to other operations that 
can vary their orientation. If the orientation is critical, the 
blocks can be transposed but prevented from having their 
orientation changed. 

The present cell representation system is not limited to a 
two dimensional representation of chip placements. For 
example, in a multilevel integrated chip, the present system 
can be extended to represent three dimensional representa- 
tions. The present invention is, in fact, unlimited in the 
number of dimensions that can be represented. 

FIG. 21 illustrates a move block operation, in which an 
irregular or L-shaped block consisting of cells 7,63 is 
moved without change in orientation. The command is 
11,103, in which 11 is the operator for move block, 10 is the 
start location of the cell in the first location of the block and 
3 is the end location for the cell in the first location of the 
block. It will be noted that the cells 14,13,10 which were 
originally in the new location of the block 7,63 are trans- 
posed to the original locations of the block. 

FIG. 22 illustrates how a block can be rotated clockwise 
in response to a command 12,6, in which 12 is the operator 
for rotate block clockwise and 6 is the location of the cell in 
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the first location of the block. A rotate block 
counterclockwise, similar to the rotate block clockwise 
shown in FIG. 22, is performed in response to the operator 
13. 

FIG. 23 illustrates an invert operation which is executed 
in response to the command 14,9,12, in which 14 is the 
operator for invert, 9 is the location of the first cell in the 
series to be inverted, and 12 is the last cell in the series. It 
will be noted that if the number of cells to be inverted is odd, 
the cell in the middle will be unchanged. Other types and 
forms of operands, especially with more than two 
dimensions, can be used and are considered and conceived 
as part of the present invention. 

Since certain operations produce faster convergence 
depending on a particular application, it is desirable to know 
not only the results of a particular operation, but the manner 
in which the operation was performed. For example, cell/ 
location swaps may produce faster convergence than 
location/location swaps, in certain types of problems, even 
though the same placements can be generated by each type 
of swap. This enables an evaluation of the relative effec- 
tiveness of each operation in a particular environment, and 
utilization of the type of operations, or combination of 
operations, which produces the best results. 

It is therefore desirable to provide a history list of the 
operations, as well as the results of the operations. The list 
can be generated automatically as the operations are per- 
formed by simply storing the commands. For example, a 
history list for the operations of FIGS. 14 to 19 would 
consist of the entries 5,1,3;6,1,2;7,1,3;9,2,4;11,10,3;12,6 as 
described above. 

FIG. 24 illustrates a genetic crossover operation that can 
be performed without modification using the cell placement 
and transposition method of the present invention, thereby 
accomplishing the goals described above. The exemplary 
operation illustrated in the drawing begins with providing a 
first parent placement 214 that is represented by an initial 
placement 216 that consists of (1)2;(2)3;(3)4;(4)1 and a 
swap list (3),(4);(2),(4);(1),(4). A second parent placement 
218 is represented by an initial placement 220 that consists 
of (1)3;(2)2;(3)1;(4)4 and a swap list (1),(3);(1),(2);(2),(4). 

It is desired to perform a crossover operation by which the 
first swap (3),(4) in the swap list of the first parent placement 
214 is transposed or swapped with the third swap in the swap 
list of the second parent placement 218. 

This produces a first child placement 222 that is repre- 
sented by the initial placement 216 and a swap list (2),(4); 
(2),(4);(1),(4). It will be noted that the first and second swaps 
in the swap list are identical, with the second swap reversing 
the first swap. However, it is important to understand that 
although the swaps were duplicated, no cells were dupli- 
cated or omitted. 

The crossover operation further produces a second child 
placement 224 that is represented by the initial placement 
220 and a swap list (l),(3)l(l),(2);(3),(4). 

In summary, the present cell placement representation as 
designated at 140 in FIG. 8 enables any type of genetic 
alteration or operation, including genetic crossover, to be 
performed on one or more cell placements, with no cells 
being duplicated or omitted, and all resulting placements 
being legal. 

5. Congestion Based Cost Function Computation 

The fitness of a particular placement is evaluated in 
accordance with the present invention using the unique cost 
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factor computation as designated at 142 in FIG. 8 based on 
the interconnect congestion of the placement, which pro- 
vides a much more accurate evaluation than the conven- 
tional methods of total interconnect wire length and maxi- 

5 mum interconnect path length. Although congestion can be 
measured accurately by performing at least a global routing 
after placement, this is extremely time consuming, and 
impractical where a very large number of placements must 
be evaluated. It will be recalled that cost can be considered 

10 as the inverse of fitness. 

The present method is based on the novel realization that 
the interconnect congestion in a placement is directly related 
to the amount of overlap of bounding boxes that can be 
defined for the individual nets of the placement. 

15 As illustrated in FIG. 25, a placement 226 of cells 228 is 
divided into "tiles" or "switch boxes" 230 that surround the 
cells 228 respectively. Bounding boxes are then defined 
around the respective nets specified by the netlist for the 
placement 226 with a detour factor 6 provided around the 

20 perimeter in the manner described above with reference to 
FIGS. 3 and 4. 

In the example of FIG. 26, a net 232 interconnects 
terminals 234, 236, 238, 240 and 242 of cells 228a, 22&b, 
228c, 228a* and 228<? respectively. A bounding box 243 is 

25 defined around these terminals. It will be noted that the 
bounding box 243 at least partially overlaps switch boxes 
230 that are designated as 230a to 230/. 

In accordance with the basic principle of the present cost 

30 factor computation, an individual congestion factor is com- 
puted for each switch box 230 as being equal to the number 
of bounding boxes that overlap, or at least partially overlap 
the respective switch box. Since each switch box 230a to 
232/ is overlapped by one bounding box 243 in FIG. 26, the 

35 congestion factor for each of these switch boxes is one, and 
the congestion factor for each of the other illustrated switch 
boxes is zero. 

The principle of the invention is further illustrated in FIG. 
27, in which a placement 245 is divided into switch boxes 

4Q 244 that enclose cell locations 246. Several bounding boxes 
are illustrated as enclosing individual nets of a netlist for the 
placement 245, but the nets themselves are not shown in 
order to avoid cluttering of the drawing. 
A bounding box 248 is illustrated as at least partially 

45 overlapping switch boxes 244a, 224b, 2A4d and 244e. A 
bounding box 250 similarly overlaps switch boxes 244c, 
2444, 244e, 244/, 244g, 244A, 244/, 244; and 244k. Another 
bounding box 252 overlaps switch boxes 244c, 244d, 244e, 
244/, 244g, 244/i, 244/, 244/ and 244*. The areas in which 

50 two bounding boxes overlap switch boxes are designated by 
rightwardly slanting hatching, whereas an area indicated by 
an arrow 254 in which three bounding boxes overlap a 
switch box is designated by leftwardly slanting hatching. 
The switch boxes 244a and 224b are overlapped by only 

55 the bounding box 248, and the congestion factor thereof is 
one. The switch boxes 244c, 244/, 244g, 244/i, 244/, 244/ 
and 244* are overlapped by the bounding boxes 250 and 
252, and the congestion factor thereof is two. The switch 
boxes 244d and 244e are overlapped by the bounding boxes 

60 248, 250 and 252 in the area indicated by the arrow 254, and 
the congestion factor thereof is three. 

The cost factor for a placement is computed by perform- 
ing a mathematical operation on the individual congestion 
factors of the switch boxes. For example, the cost factor can 

65 be defined as the maximum or average value of the conges- 
tion factors. However, a more accurate estimation of the 
actual congestion of a placement can be obtained using more 
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sophisticated operations. For example, the cost factor can be by only the bounding box 276, and its congestion factor is 
preferably denned as the sum of the squares of the individual also one. The switch box 258r is overlapped by both 
congestion factors. Another operation that can be advanta- bounding boxes 268 and 276, and its congestion factor is 
geously employed is to define the cost factor as the "soft two. The switch box 258* is overlapped by only the bound- 
maximum" of the individual congestion factors, which is 5 ^5 box 276 > and its congestion factor is one. The congestion 
defined as factors of all other switch boxes 258, even if they are 

overlapped by one or more bounding boxes, are zero. 

w . . It is further within the scope of the invention to modify the 

MP) = logjf Y & 1 1 m I method such that a switch box can have a non-zero conges- 

II m V J 1Q tion factor only if it is overlapped by at least one bounding 

box and is within a predetermined distance of a terminal. As 
illustrated, circles 278, 280, 282 and 284 having a prede- 

where f c (P) is the cost function for placement P, M is the termined radius are defined around the terminals 264, 272, 

number of switch boxes, i is a counter from 1 to M, c f - 266 and 274 respectively. The congestion factor of any 

is the congestion factor for a switch box i, and a is a switch box 258 that is overlapped by a circle will be 

variable or constant which is selected in accordance 15 computed in the same manner as a switch box 258 that is 

with a particular application. overlapped by a terminal. 

It is further within the scope of the invention to combine In the illustrated example, the switch box 258a* will have 

the congestion based cost function f^P) with one or more a congestion factor of one since it is overlapped by the 

other fitness or cost estimations, for example the total wire bounding box 268 and the circle 278. The switch box 258* 

length estimation obtained using the half-perimeter method 20 will have a congestion factor of one since it is overlapped 

described with reference to FIGS. 3 and 4. Other cost circle 280. The switch box 258s will have a congestion 

estimations that can be combined with the cost function f c (P) fa <f[ of one ™« * " overlapped by the bounding box 276 

include, but are not limited to, maximum path length, and lhe c * cle f 28 ?> wteieas the switch box 258* wiU have 

channel capacity overflow and row and/or column lengO, ^MKi" by 

lhe individual components of a composite cost function 25 ms / modifications ^ also ^ wei hted in differcnt 

can also be weighted, for example wa?rs me mpt of the mventioa . For example, a 

CF=af (P)+6A/PW (P)+£f switch box that is overlapped by at least one bounding box 

c ww* 90 but is not overlapped by a terminal or circle can have a 

™. , . 4 - - . if _ . non-zero congestion factor that is weighted lower than if the 

where CF is the composite cost function, f c is the present 30 sM box ^ overl d 5 a or circle> It ^ 

congestion based cost function, fjft 1 is the estimated ^ tQ prQvide differem weightings for switch 

total wire length, f/P) is the estimated maximum path boxes ^ { are overlapped by terminals and circles respec- 

length, ^ is a predetermined overflow factor, and a, p, tively ^ manner ^ which the we i g htings are applied is not 

Y and 1 are proportionality constants that constitute Umited the of the i nven tion. 

weighting factors. 35 

Various modifications are possible to the present methods, 6 - Improved Genetic Algorithms for Physical 

for example, setting the congestion factor for a switch box Design Automation 

equal to the number of bounding boxes that overlap the a - Basic Algorithms 

switch box only if a terminal of one of the associated nets The basic genetic algorithm, which is advantageously 

overlaps or is within a predetermined distance of the switch 40 modified in accordance with the present invention as will be 

box. This provides a more accurate estimation of congestion described below for application to integrated circuit physical 

in placements including significant numbers of idle cells, design, illustrated in the form of a flowchart in FIG. 29. The 

since the idle cells will not have any interconnections. oasic genetic algorithm includes the genetic operations of 

Assigning non-zero congestion factors to idle cells would reproduction, crossover and mutation, 

produce an erroneously high indication of congestion. 45 1° the first step of the algorithm, the number of genera- 

FIG. 28 illustrates the implementation of this modification tions t0 be produced, designated as G, is initialized to zero, 

to the basic method. A placement 256 includes switch boxes Then, an initial population of M representations is randomly 

258 that surround cells 260. A first net 262 interconnects created. This is necessary because the possible number of 

terminals 264 and 266 of cells 260a and 260b respectively placements of N cells is N! , and for an integrated circuit chip 

and is surrounded by a bounding box 268. Another net 270 50 including hundreds of thousands of cells N! will be such a 

interconnects terminals 272 and 274 of cells 260c and 260d nu g e number that the amount of data representing all of the 

respectively, and is enclosed by a bounding box 276. possible placements could not easily or reasonably be pro- 

The bounding box 268 overlaps switch boxes 258a to cessed existing computer technology. 

258/, 258* to 258m and 258/? to 258r. The bounding box 276 Next, the following substeps are iteratively performed on 

overlaps switch boxes 258; to 258*. Both bounding boxes 55 tne population of placements until a predetermined termi- 

268 and 276 overlap switch boxes 258* to 258/w and 258p nation criterion has been satisfied. 

to 258/: (a) Evaluate the fitness of each placement of the popula- 

In one modified form of the invention, a non-zero con- tion. 

gcstion factor is computed for a switch box only if a terminal (b) Create a new population of placements by applying 

of a net overlaps the switch box. In the example of FIG. 28, 60 the following three operations. The operations are 

only the switch boxes 258a, 258/, 258r and 258* which are applied to individual placements in the population 

overlapped by the terminals 264, 272, 266 and 274 respec- chosen with a probability based on fitness, 

tively will have non-zero congestion factors. Other conges- a. Copy existing individual placements to the new 

tion (criteria) factors may also be used and are contemplated. population (genetic reproduction). 

Returning to the other embodiment, since the switch box 65 b. Create two new placements by genetically recom- 

258a is overlapped by only the bounding box 268, its bining randomly chosen schema from two existing 

congestion factor is one. The switch box 258; is overlapped placements (genetic crossover). 
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c. Create a new placement from an existing placement linearly with the placement rank i. Preferably, ^ is a constant 
by randomly transposing cells in the placement having the value |=M/(M-1), where M is the number of 
(genetic mutation). placements. The result is a weighted fitness V, 0 for each 
(c) The best individual placement that appeared in the last placement having the value V^-f^g 1 , such that the 
generation (i.e. the best-so-far individual) is designated 5 weighted fitness V, 0 increases non-linearly with placement 
as the result of the genetic algorithm. rank i as illustrated in FIG. 31. 
The relative instances in which the three genetic opera- FIG. 32 illustrates how a weighted fitness summation S (0 
tions will be performed are specified by a reproduction rate is computed for each placement as being equal to S^-SV^. 
P*, a crossover rate P c and a mutation rate P^. For example, In other words, the summation S (0 is equal to the algebraic 
given an initial population of M-1,000 placements and a 10 sum of the weighted fitness of the respective placement and 
reproduction rate P^ of 10%, 100 of the initial 1,000 the weighted fitnesses of all placements having lower fit- 
placements will be copied into the new generation without ness. 

alteration. A placement is selected by generating a random number 

The crossover and/or mutation operations are performed K between zero and a maximum value T, where T is equal 

to create the remaining 900 placements. Assuming a cross- to the weighted fitness summation of the placement having 

over rate P c of 60% and a mutation rate P^ of 30%, 600 of 15 the highest fitness. The random number K is generated by 

the placements will be generated by selecting 300 pairs of first generating a random number having a value between 0 

parents, and each pair will be genetically mated to produce and 1, and then multiplying K* by T such that K-KT. 

600 children or offspring. 300 of the initial placements will As illustrated in FIG. 32, K is greater than the weighted 

be subjected to mutation. Thus, the new generation will have fitness summation V3 for the third worst placement and less 

the same number of placements as the initial population; 100 20 than the weighted fitness summation V4 for the fourth worst 

of which were reproduced without alteration, 600 of which placement. Thus, the placement having the summation V4 is 

were created by crossover and 300 of which were subjected selected for the crossover operation. It will be understood 

to mutation. that essentially similar results can be obtained by selecting 

As described in a textbook entitled "GENETIC the placement having a weighted fitness summation that is 

PROGRAMMING", by John Koza, MIT Press, Cambridge, 25 closest to the random number K. 

Mass. 1993, pp. 94-101, placements are selected for genetic Where the method of FIGS. 30 to 32 is applied to genetic 

alteration on the basis of fitness such that placements with crossover, it is applied twice to select two placements for 

higher fitness have a higher probability of being selected. crossover. Although the actual selection is random, the 

However, it is desirable for less fit placements to be included probability of a placement being selected increases non- 

to prevent the loss of potentially desirable genetic material 30 linearly with its fitness or rank. This is evident from FIG. 31, 

and premature convergence to local optima. in which the weighted fitnesses and thereby the fitness 

One common selection criteria is "fitness proportionate summations 2V W increase non-linearly, more specifically 

selection", in which the probability of a placement being by whereas the random number K is generated linearly, 

selected is linearly proportional to its fitness. A variation of It will be understood that although the method of FIGS, 

fitness proportionate selection is "rank selection", in which 35 30 to 32 is especially suited for selecting two placements for 

the selection is linearly proportional to the relative ranking genetic crossover, it can be used in any other environment in 

of placements in the population. which it is required to select an entity from a ranked set such 

A third selection criteria is "greedy overselection", in that the probability of selection increases non-linearly with 

which the placements are ordered by fitness and divided into rank. An exemplary alternative application is the selection of 

two or more groups based on fitness. A larger number of 40 placements for genetic operations, 

placements are selected from the fittest groups than from the EXAMPLE 
less fit groups. 

A "greedy mutation" algorithm is described on page 173 FIGS - 33 t0 36 illustrate the results of a computer simu- 

of the above referenced textbook to Sherwani. In this lated uniform crossover operation utilizing the method 

algorithm, a cell is selected at random, and the program 45 described with reference to FIGS. 30 to 32. In uniform 

searches the cells in the same net to find the cell that is crossover, cell transpositions or swaps are made such that 

farthest from the randomly selected cell. The farthest cell is the cells in the same locations in two placements are 

then transposed to a location adjacent to the randomly transposed. The proportion of locations to be transposed 

selected cell, and the cell in that location is pushed out- versus the proportion of locations to be unchanged is deter- 

wardly until a vacancy is found. 50 mined DV a ratio » wnich in tDe present case is 50%. The 

These prior art selection of mutation methods are limited locations for transposition are selected randomly in accor- 

in effectiveness as they do not address the cost factors of the dance with the ratio. 

individual cells in the placements. The present inventors In the flowchart of FIG. 33, a location counter is initial- 
have discovered that the convergence rate of the genetic ized to zero, and a random number R having a value between 
algorithm can be substantially increased using the unique 55 0 and 100 is generated. If the number R is less than 50, the 
methods of the invention as described below. cells in the locations in the parent placements corresponding 
b. Statistical Selection to the number in the location counter (initially 0) are 

A statistical method of selecting placement for crossover unchanged. If the number R is greater than or equal to 50, 

in accordance with the invention is illustrated in FIGS. 30 to the cells in the specified location are transposed or swapped. 

36. FIG. 30 illustrates the first step of the method, in which 60 The swap is advantageously performed using the cell/ 

the individual placements are sorted or ranked in terms of location representation method described above with refer- 

increasing fitness (decreasing cost). The individual place- ence to FIG. 16. The location counter is then incremented 

ment fitness is generally proportional to the placement and the operation loops back such that another random 

rank, where i is the placement rank in increasing order of number is generated and a decision is made whether or not 

fitness. 65 to perform transposition for the next cell location in the 

In the next step, the fitness f^ of each placement is placements. The process ends when the last cell location has 

multiplied by a weighting factor that increases non- been subjected to reproduction (unaltered) or crossover. 
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FIGS. 34a and 34b in combination represent the results of is selected in accordance with the cost computation, and 

the simulation. The parameters used for the simulation were swapped with a cell X in a randomly selected location in the 

cell/location swaplist representation, statistical selection (as placement 290. 

described with reference to FIGS. 30 to 32), 0% mutation FIG. 39 illustrates an alternative method in which a worst 

rate, 10% reproduction rate, 90% crossover rate (uniform 5 cell WC in a placement 292 is swapped with an adjacent cell 

crossover), 1,000 placement population size and 100 cells/ X. The geometric relationship of the cell X to the cell WC 

placement arranged in a 10x10 cell grid. can be selected randomly, or in accordance with a predeter- 

Each generation is represented by the notation "step" in mined criteria. For example, the cell X can be the worst cell 

the printout, whereas "mcmcsnls" indicates that a genetic among the cells adjacent to the cell WC. 

crossover operation has been performed for the generation in 10 ^ ^ n ^ xl T ens l ion of th f method °£ 39 iS , iUuslratc l„ I ! 

accordance with the flowchart of FIG. 33. The simulation FIG - 40 - 1" «se, the worst cell wa ma placement 294 

was performed for 40 generations, with the cost factor for » ^PP* lo ^ l°f ™ °f a cell WQ which has the 

, . ■ ° „ m j t \. a j:ff„™„« k~«.w«-_ lowest fitness among the cells adjacent to the cell WC1. I ne 

each generation being expressed as the difference between £ occupied by WC2 is swapped to 

he computed cost and a predetermined optimal cost ^ q{ & ^ W J ^ ^ fitnes f of tne 

(distance from optimal soluUon). 15 ^ omer lhan lhe ^ WC1 mat afe ad j acent tQ me mi(ial 

For each generation, the program computed the minimum location of the cell WC2. The cell WC3 is swapped into the 

cost (for the best placement in the generation), the average init j a i location of the cell WC1. It will be understood that the 

cost of all the placements in the generation, the maximum operation of FIG. 40 results in a cyclical transposition of the 

cost (for the best placement in the generation) and the cells WC1, WC2 and WC3. 

standard variance of the costs. The solution converged to 20 FIG. 41 illustrates another greedy mutation method 

produce a placement with a cost factor of zero in the 40th according to the present invention, in which a worst cell 

generation. FIG. 35 illustrates the numerical identifiers of WC1 of a placement 296 is swapped with the second worst 

the cells in the final placement with the zero cost factor. FIG. cell WC2 in the placement 296. FIG. 42 illustrates a greedy 

36 illustrates an example of the present method. mutation operation on a placement 298 which is an exten- 

It will be understood that the present statistical selection 25 sion of the method of FIG. 41 in which cells WC1, WC2, 

method is not limited to uniform crossover. The principles of WC3 and WC4 that are ranked in order of decreasing fitness 

the invention are applicable to any type of crossover are transposed cyclically, 

operation, such as one-point, two point and t-point cross- FIG. 43 illustrates yet another greedy mutation method of 

over. the invention, in which a placement 300 includes a worst cell 

c. Greedy Crossover 30 WC, and additional cells X, Y and Z that are interconnected 
FIGS. 37 to 43 illustrate "greedy" genetic alteration with the cell WC in a net 302. In this method, the center of 

methods that yet further increase the rate of convergence in mass of the cells in the net 302 is computed, and the worst 

the application of genetic algorithms to integrated circuit cell WC is swapped with a cell CM that is located at the 

chip placement and other applications. center of mass of the net 302. 

FIG. 37 illustrates a greedy crossover operation in which 35 In all of the methods of FIGS. 37 to 41, cells can be 

the worst cell WC1 in a placement 286 is selected as having selected in accordance with an alternative cost criterion, 

the highest congestion. The cost for each cell is preferably such as highest fitness (lowest cost), in which the best cell 

computed using the congestion based method described in the placement is selected to be swapped, 
above with reference to FIGS. 25 to 28. However, the ? Optimal Switching of Algorithms 

invention is not so limited, and any other method of cost 40 

evaluation, such as the prior art half-perimeter approxima- ^ P resent method is capable of performing various 

tion method described above with reference to FIGS. 3 and placement optimization (fitness improvement) algorithms 

4 can be used. and/or cost (fitness) computation algorithms simultaneously, 

In the method of FIG. 37, the worst cell WC1 in the m combination, and/or switch between algorithms in an 

placement 286 is swapped with a cell X in a placement 288 45 manner that is predetermined to optimize the processing 

which has the same location in the placement 288 as the cell efficiency. Such placement optimization or fitness improve- 

WC1 has in the placement 286. The swap is preferably ment algorithms include, but are not limited to, simulated 

performed using the cell/location representation method. evolution, mutation, simulated annealing, constructive 

More specifically, the location/cell list for the placement 286 placement, force directed placement and variants thereof, 
is searched to determine the location of the cell WC1, and 50 An example of optimal switching between placement 

the contents of that location are swapped for the contents optimization or fitness improvement algorithms in accor- 

(the cell X) of the corresponding location in the placement dan ce with the present invention is illustrated in FIGS. 44 

288. and 45. FIG. 44 illustrates the typical characteristics of two 

TTie method of FIG. 37 is not limited to swapping a single algorithms that are available for use by the present physical 

pair of cells. For example, a next worst cell WC2 in the 55 design automation system 130 as illustrated in FIG. 8, more 

placement 286 can be swapped for a cell Y in the corre- specifically simulated evolution, and a variant of simulated 

sponding location in the placement 288. A single crossover annealing known as "TimberWolf 3.2" as described in the 

operation can include swapping any number of pairs of cells above referenced article to Sechen. 
that can be selected using any suitable criterion. For The horizontal axis in FIG. 44 represents the number of 

example, the best cell in the placement 286, rather than the 60 alterations performed by the algorithms. In the case of 

worst cell WC1, can be swapped to the placement 288. It is simulated evolution, the alterations are genetic crossover 

further within the scope of the invention to swap adjacent operations, whereas in the case of simulated annealing the 

pairs, triplets or longer strings of cells, or blocks of cells. alterations are cell pair transpositions. The vertical axis 

d. Greedy Mutation represents the fitness value of the cell placement having the 
FIG. 38 illustrates a greedy mutation operation according 65 highest fitness in the population of cell placements. 

to the present invention. As with the greedy crossover The simulated evolution algorithm converges rather rap- 
operation of FIG. 37, the worst cell WC in a placement 290 idly to a cost value CI after a number Tl of alterations, and 
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changes relatively slowly thereafter. The simulated anneal- tion as described with reference to FIGS. 3 and 4, maximum 

ing algorithm requires many more alterations to reach the pathlength, and combinations thereof including the present 

cost value CI, as indicated at T2. However, the simulated composite cost function CF«af c (P)+pf H XP)+Yf 1 (P)+ri. 

annealing algorithm converges more rapidly than the simu- In the latter case, the composite cost function can be used 

lated evolution algorithm to cost values below CI. 5 exclusively, and switching performed by changing the val- 

Tnis information is utilized in accordance with the inven- ues of i the ™*al constants a, Y y and g in a manner that 

tion to optimize the cell placement process by using the J? predetermined m accordance with the particular apphca- 

simulated evolution algorithm to achieve rapid convergence tl0n - However ; s " ch , swltcmD / 15 n ?\ ***** the P re ^ Dt 

, . t . r.u t - •* \- composite cost function, and can be applied to switching 

during the initial phase of the operation, and then switching u * * a *• u • *u c j . 

. *u ■ i ♦ a i- i -*u ♦ ■ *u m between any cost functions having the same form and at 

to the simulated annealing algorithm to increase the con- 10 , J . . . — . , « • 4 - 4 . 

.j ■ /» i ■ ri . t . f ,t least one variable coefficient, such as between cost functions 

vergence rate during the final phase of the operation. In the , ~- , . 

... - . j i lL • »• *„ • r • * Fl and F2 as expressed by 

illustrated example, the optimization cntenon for maximiz- n/ . * ' f J 

ing the convergence rate is to switch from simulated evo- M-Ajinpj ana 

lution to simulated annealing when the cost value cell F2«A2il(p)+B 2 i2(p) 

having the highest fitness in the population reaches the value 15 where fl(p) is a first predetermined function of a 

Ql placement, £2(p) is a second predetermined function of 

Acomputer simulation utilizing the optimization criterion a P ,acemenl Md A* *2> B, and B 2 are predetermined 

described with reference to FIG. 44 is illustrated in FIG. 45. ~ nS « IS- •„ , _, , , L 

Three curves are illustrated, representing the minimum, . F,G . S 46 <° I 9 Jlu ? rate a P 1 ^"" 1 . exam P le of « he 

average and maximum fitnesses of the cells in the popula- 20 mvei " 10n £ wmch two fitness («*> ft" 5 "* 005 are °P tu ? aU y 

Uon. The horizontal axis represents the number of genera- swltol i ed . from . on . e t0 the ° tber ,D w***"* with an 

tions of genetic crossover, whereas the vertical axis repre- optmuzrtion criterion to maximize convergence of the cell 

sents fitness placements toward an optimal configuration. 

_ . , . . , . FIG. 46 illustrates a simplified example for a 3x3 array of 

The process was switched from simulated evolution to M ^ m wUch ^ horizontal axis represenls msl values 

simulated annealing after a number 13 of generaUons based on , he rfor ^ " half . perimet6I ,. ^ length 

(genetic crossover operations). It will be seen that the ^ M dfiscribed ^ refcrence to plGS 3 and 4 and ^ 

minimum and maximum fitness values increasem a gener- ver[ical axis represents values based on congestion as 

ally stepwise manner after the switchover at T3 with the desaibed ^ reference t0 mGS 25 10 ^ch cross « + .. 

maximum fitness value, which corresponds to the highest ^ represents a pi aceme nt having a cost with a corresponding 

fitness cell placement, attaining a substantially higher value Dume H C ai value on the respective axis. 

In this example, not only is the convergence rate increased nG 4? ^ & smoothed ve(sion of ^ mfomat i on illus . 

by the switchover, but a more fit cell placement is produced ^ m nG 46 in which the numbels of lacements are 

than could be attainable using simulated evolution alone. pbUed versus ^ values for me 

The optumzation criterion for switching between various 35 and wirelength based cost functions respectively. It will be 

fitness improvement algorithms can take a number of forms f or both cost functions that the numbers of placements 

depending on a particular application. Examples of such ^ i ow f or extremely high and low cost values, and are 

criteria include, but are not limited to the following. maximum for intermediate cost values. 

1. Switch when the cost value of the most fit placement Optimal switching between different cost function corn- 
reaches a predetermined minimum value (the fitness reaches 40 putation methods prevents the optimization processing 
a predetermined maximum value) as illustrated in FIG. 44. (simulated evolution or annealing, etc.) from becoming 

2. Switch after a predetermined number of processing trapped at local optima, and also increases the rate of 
steps (genetic crossover operations, simulated annealing cell convergence toward a most fit or optimal placement. As best 
transpositions, etc.) have been performed. seen in PIG- 47, the cost values are plotted as decreasing 

3. Switch when a predetermined number of processing 4S ( ,oward » m ° re fit pl<"*?™0 right to left. The con- 
steps has been performed without producing a change in the 8 est,on base d «* fuacUoa »° » maxi- 
cost value of the most fit placement. mum Dumber of Placements « * co* v ^ of approximately 

j . • j , » . 25, and then decreases. The wirelength based function peaks 

4. Switch when a predetermined number of processing ^ , „ a much hi her number * f lacemcnts at a H cost 

steps has been performed without producwg a change larger J0 value of approxiinately 55. 

laament ^ m ^ ^ kn0 * n behavior ° f different ^ ° f 0051 funCtionS 

" ' enables optimal switching therebetween based on predeter- 

Although a preferred example of the invention in which a mme d criteria. Based on the information in FIGS. 46 and 47, 

switch was made at an optimal point in the processing the congestion based cost function computation Ls preferably 

operation from simulated evolution to simulated annealing 5S use d during the initial portion of the optimization 

has been described and illustrated, the invention is not so processing, since the initial cost functions are relatively high 

limited. Numerous other algorithms are available that can be an a the number of placements are low. 

optimally switched in accordance with the invention, includ- gra dual increase in the number of placements enables 

ing simulated evolution, mutation, simulated annealing, greater differentiation between similar placements, thereby 

constructive placement, force directed placement and vari- 60 increasing the effectiveness of the optimization processing 

ants thereof. anc j tne rate of convergence toward an optimal solution. The 

8. Optimal Switching of Cost Functions progressive change in the cost value resists trapping of the 

r ° optimization processing at a local optima. 

Fitness (cost) computation algorithms that can be utilized The two cost value curves intersect at a cost value of 

by the invention include, but are not limited to, the conges- 65 approximately 35. At this point, the congestion based cost 

tion based cost function as described above with reference to value is continuing to increase, whereas the wirelength 

FIGS. 25 to 28, the "half-perimeter" wire length computa- based cost value is decreasing sharply. In the illustrated 
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example, the cost value computation is preferably switched 
from congestion to wirelength based when the average cost 
value for the placements in the population reaches approxi- 
mately 35. This prevents the processing from being trapped 
at a local optima, and causes rapid convergence to place- 
ments with low cost value in a minimum period of time or 
processing steps. 

Based on the known relative characteristics of different 
cost functions as exemplified in FIGS. 46 and 47, various 
optimization criteria can be utilized as switching points in 
actual processing environments. Examples of such criteria 
include the following. 

1. Switch when the cost value of the most lit placement 
reaches a predetermined minimum value (the fitness reaches 
a predetermined maximum value). An exemplary cost value 
for the example of FIG. 47 would be approximately 30 to 40. 

2. Switch after a predetermined number of processing 
steps (genetic crossover operations, simulated annealing cell 
transpositions, etc.) have been performed. 

3. Switch when a predetermined number of processing 
steps has been performed without producing a change in the 
cost value of the most fit placement. 

4. Switch when a predetermined number of processing 
steps has been performed without producing a change larger 
than a predetermined value in the cost value of the most fit 
placement. 

FIG. 48 illustrates a computer simulation of optimal 
switching of cost value computation utilizing the relation- 
ship between congestion and wirelength based cost func- 
tions as illustrated in FIGS. 46 and 47. The example assumes 
a 10x10 array of cells, and a known optimal placement. The 
horizontal axis represents the number of steps of optimiza- 
tion processing, in this case generations of simulated evo- 
lution (genetic algorithm), whereas the vertical axis repre- 
sents fitness. Three curves are illustrated, representing 
fitness values for the placements in the population having 
minimum, average and maximum fitness respectively. 

The example of FIG. 48 is also tabulated numerically in 
FIGS. 49a to 49c. It will be seen that the maximum fitness 
increases rapidly for approximately the first 8 generations, 
and then tapers off to a more gradual slope. The cost value 
computation was initially performed using a congestion 
based cost function, and switched to the wirelength based 
cost function after the 18th generation. The fitnesses 
increase in almost a vertical step to approximately 55 to 57, 
and maintain these values during subsequent processing. 

The tabular listing of FIGS. 49a to 49c represents the 
fitnesses in terms of cost values rather than fitnesses. The 
minimum, average and maximum cost are computed using 
the congestion based cost function for generations below 18 
and the wirelength based cost function for subsequent gen- 
erations. Other variables which are of interest and are not 
self-explanatory include: 

avg_distance — average distance or difference between 
placements, expressed as average number of cells in 
different cell locations over the population of place- 
ments. 

cheat— difference between average placement and prede- 
termined optimal placement, expressed as average 
number of cells over the population of placements that 
are in different locations from those in the optimal 
placement. 

ranko— cost value computed using the present composite 
cost function. 

yugo — cost value computed using the congestion based 
cost function as modified to require that a terminal of 
a net of one of the bounding boxes overlap or be within 
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a predetermined distance of a switch box in order for 
the congestion factor to be computed as the sum of the 
overlapping bounding boxes, 
wire — cost value computed using the wirelength cost 
5 function. 

9. Simultaneous Placement and Routing (SPAR) 

The SPAR methodology is an attempt to place very large 
designs using a more accurate cost function in a short time 

10 by the use of multiple processors. Present placement pro- 
grams use crude cost functions usually based on an estimate 
of wire length. In spite of the use of these crude cost 
functions the run times are often measured in days. 
The SPAR methodology divides the problem into sub- 

15 tasks and shares them among a number of processors. This 
increase in processing power allows the use of a more 
complex cost function while still significantly reducing the 
elapsed time. 
Operation 

20 The SPAR methodology alternates between a congestion 
calculation and a placement improvement mode. In both 
modes one process assigns tasks and collects data. This 
"host" process requires very little computation and is able to 
support many "worker" processes. In congestion calculation 

25 mode the assignments consist of lists of nets; in placement 
improvement mode the assignments consist of lists of cells. 

Congestion Calculation 

3(J In congestion calculation a routing is determined for each 
net independently. It is assumed that the terminals of a cell 
exist at two nodes in the two routing channels on each side 
of the cell. First a minimum spanning tree connecting the 
nodes of the net is generated. The edges of this minimum 

35 spanning tree which cross cell columns are then used to 
generate column feedthroughs. Each edge crossing a column 
generates two feedthroughs, one at the Y coordinate of each 
node of the edge. 

The global edges of the minimum spanning tree have now 
served their purpose and are then discarded. Channel edges 
are generated in those channels which contain cell terminal 
nodes. An edge starts at the top terminal or feedthrough of 
the channel and connects to the next terminal or 
feedthrough. This continues until the bottom terminal or 

45 feedthrough in the channel. 

The entire set of channel edges is sorted in decreasing 
order of cost (length weighted by previous congestion data). 
The set of channel edges and feedthroughs is then reduced 
by removing those edges beginning with highest cost edges 

50 which can be removed and still have all terminal connected. 
The chip is divided into global grids (channel segments 
ten wiring grids long) and the congestion cost for each 
global grid incremented by one for each net which is routed 
into or through it. Because this data is computed indepen- 

55 dently for each net, it can be reported to and summed by the 
host to obtain a congestion map of the placement. On the 
other hand, because the edge costs are based on the previous 
congestion run, as the placement approaches a stable state 
the routing also approaches a true global routing. 

60 

Placement Improvement 

After the host has received congestion data for all nets for 
the current placement, it evaluates the cost contribution of 
the placement of each cell. The cells having the highest cost 
65 contributions are then selected for placement improvement. 
The tasks of improving the placement of these cells are then 
assigned to 'worker' processes. 
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The cost function is then evaluated for the cell in each 
location within a window around its current location. The 
use of a window limits calculation and a cell that need to 
move beyond the limits of the window will move to the edge 
of the window and then be selected for improvement again 
in a later pass. Since placement improvement is done on a 
cell by cell basis all relevant cost function calculation must 
refer to a single cell. 

Since the routing mode is relatively expensive, it may be 
desirable to do several passes of placement improvement 
between routing mode passes. After each placement 
improvement pass the accumulated error in the congestion 
calculation could be estimated. If this error exceeds some 
limit, a new congestion calculation routing pass should be 
executed. 
Cost Function 

The cost function used in SPAR preferably consists of the 
following terms: 

1) wire length 

2) cell column variance 

3) cell overlap 

4) routing congestion 

The routing congestion term is the key to a quality layout 
and is the most difficult to obtain. 

Wire Length 

The wire length component of the cost function assigned 
to a cell in placement improvement is determined as follows. 
The bounding box for the terminals of the net excluding the 
current cell is determined. The manhattan distance that the 
current cell lies outside this bounding box is the wire length 
charged to the cell. A possibility is to modify this number by 
the ratio of the size of the current net bounding box to the 
size of the lower bound on that box. The size of the lower 
bound on the box would be computed by the sum of the areas 
of the cells on the net plus their associated channel area. 

Column Length Variance 

When a cell is being evaluated for a location in placement 
improvement, the amount that the addition of the cell to that 
column will make the column longer then the column 
average must be charged to the cell. It may be desirable to 
increase this cost nonlinearly as the amount that the channel 
exceeds the average increases. 

Cell Overlap 

If the location of a cell in placement improvement causes 
the outline of the cell to overlap the outline of another cell, 
the amount of that overlap is charged to the cell. 

Congestion 

The congestion cost of a net passing through each global 
grid of the channel is determined in the routing phase. 
Therefore the net congestion cost could be calculated by 
summing the cost of the global grids through which the net 
passes. This calculation is complicated by the global con- 
gestion and the net routing not being available at the same 
time in the same processor. Another problem is how to 
assign this net congestion cost to individual cells. One 
approach is to divide the net cost equally between the cells 
on the net. 

Interactions and complications 

Since multiple processors are modifying the placement 
simultaneously, interactions between cell placements cause 



potential problems. The first interaction is the relationship 
between cells in calculating the column length and cell 
overlap. If a single cell attempts to move it will likely not 
find a suitable open slot and the resulting channel length and 

5 overlap costs will prevent its movement. 

A solution is to process the cells in a placement improve- 
ment task as a batch. If the space occupied by the one 
thousand plus cells in a batch is freed before the search for 
new locations begins, the cells can effectively swap position 

10 with no problem. For this reason it is desirable to assign cells 
from the same region of the chip to the same task group; 
although, if this is done, care should be taken to insure that 
the regions vary from pass to pass. 

Another interaction is that of two (or more) cells on the 

15 same net but in different placement improvement groups 
moving toward each other and in the process passing each 
other. This is a concern; however, in most cases the result 
will still be a reduction in the net bounding box. 
Furthermore, if cells in the same region are grouped into the 

20 same task group as suggested in the previous paragraph, the 
cost calculation will no longer be independent. The new 
location of the first cell improved will be used in the 
calculation of the new location of the second cell. 

The process decomposition and recomposition method- 

25 ology is generally illustrated in the flowchart of FIG. 50. 
The first step of the method is to generate an initial 
placement using a floorpl arming, partitioning or other place- 
ment algorithm. The initial placement can be generated 
using a hierarchial structure specified by the designer, or 

30 such a structure can be discovered using a partitioner. In any 
case, cells that are connected to each other are grouped 
together, and the groups are roughly distributed on the chip 
area in accordance with their functional, connective or other 
associative relationships. 

35 After the initial placement has been generated, a global 
routing is performed, preferably using an algorithm that be 
decomposed into tasks which can be performed simulta- 
neously in parallel. It is further within the scope of the 
invention to perform initial placement and routing by divid- 

40 ing the cell into contiguous non-overlapping areas, and 
using parallel processors to perform placement and/or rout- 
ing in the areas individually. A global placement and/or 
routing can then be recomposed from the results of the local 
operations. The initial placement and routing can also be 

45 performed simultaneously on nets or other groupings of 
nets. 

The global routing provides a detailed mapping of the cell 
interconnects for the placement, and enables accurate com- 
putation of cell interconnect congestion. A fitness or cost 

50 value is computed for each cell in the placement. 

The fitness (cost) computation algorithms that can be 
utilized by the invention include, but are not limited to, the 
congestion based cost function as described above with 
reference to FIGS. 25 to 28, the "half-perimeter" wire length 

55 computation as described with reference to FIGS. 3 and 4, 
maximum pathlength, and combinations thereof including 
the present composite cost function CF-af^PJ+pf^fl^+Yfj 
(!■)+&>• 

The individual cost values are utilized to identify the most 
60 congested areas of the placement. Numerous standard sta- 
tistical methods can be utilized to provide smoothed values 
of localized congestion, such as taking averages of cost 
values over individual areas of the placement. 
In addition, the cost values for individual cells can be 
65 modified in accordance with the computed local congestion 
as illustrated in FIG. 51 to sharpen the contrast between 
congested and uncongested areas. A cell X is connected to 
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a net 304 that includes additional cells A, B, C and D that are Preferably, the method illustrated in FIG. 50 is performed 

surrounded by a perimeter as indicated in broken line at 306. on a plurality of initial placements simultaneously in accor- 

An area 308 of high congestion is located between the cell dance with process decomposition methodology of the 

X and the cells within the perimeter 306 such that a wire 310 invention. The altered placements are then evaluated on the 

connecting the cell X to the other cells A, B, C and D in the 5 basis of fitness. The most fit placement can be selected and 

net 304 passes through the congested area 308. all other placements discarded, with the single selected 

The cost value for the cell X can be modified in several placement being retained to produce the finished integrated 
ways in accordance with the invention to reflect its relation- chip design. Alternatively, several of the most fit placements 
ship to the congested area 308. For example, the cost value can be subjected to further placement optimization process- 
can be increased by a predetermined function of a length S 10 ing operations such as genetic crossover, 
of the wire 310 between the cell X and the perimeter 306. If the fitness of the placement has not attained a prede- 
Alternatively, the cost value can be increased by a prede- termined value after performing the congestion reduction 
termined value for each congested area that the wire 310 operations, the steps of identifying the most congested areas 
passes through, or by an amount proportional to the size of and improving the placement as described above are 
each congested area. 15 repeated. This procedure can be repeated any number of 

A congestion reduction algorithm is then applied to the times until the fitness has been improved to a sufficient 

congested areas simultaneously using parallel processors. extent. 

The algorithm selects cells to be moved based on their As discussed above, global routing is very time 

individual cost values and proximity to congested areas, and consuming, and it is desirable to perform it only when 

can be considered as comprising "suggestion generators" for 20 absolutely necessary. This is made possible in accordance 

suggesting improvements to the placement. For example, the with the method of the present invention, while still improv- 

cell in each net that has the highest congestion based cost ing the fitness of a placement in a progressive manner, 

value can be selected for relocation. Each cell that is relocated without performing global 

A major objective in improving the placement is to rerouting creates an error in the initial global routing that 

reroute the wiring that passes through congested areas so 25 was processed to obtain the placement congestion informa- 

that it does not pass through these areas. This can be tion. A corresponding error is therefore created in the 

accomplished in accordance with the invention by relocating congestion mapping. A certain amount of error can be 

cells in a number of ways. tolerated, as long as the error is not compounded to such an 

For example, as illustrated in FIG. 52, the cell X that was extent that the accuracy of the congestion mapping is 
initially located outside the perimeter 306 can be relocated 30 unacceptably degraded and the congestion reduction opera- 
inside the perimeter 306. A preferable method of selecting tions do not produce effective results, and/or the system 
the new location for the cell X is to compute it as being the begins to exhibit oscillatory behavior, 
centroid (center of gravity, mass, area, etc.) of the area The error can be acceptably managed by estimating the 
enclosed by the perimeter 306 as described above with effect of the error on the congestion mapping, and perform- 
reference to FIG. 43. 35 ing a new global routing when the error is determined to 

Alternatively, as illustrated in FIG. 53, the cell X can be have exceeded a predetermined level. This enables a number 

tentatively relocated to a plurality of locations in proximity of iterations of congestion reduction and evaluation to be 

to its initial location, and the net 304 rerouted for each of the performed before a global rerouting is necessary, thereby 

new proposed locations. A suitable location is one in which substantially reducing the time required for performing the 

none of the wiring of the net passes through the congested 40 processing. 

area 308 or any other congested areas. The new location can The cumulative effect of the error will differ in accordance 

be inside or outside the perimeter 306. with each particular application, and is preferably evaluated 

It is further within the scope of the invention to perform and estimated empirically. In a case in which the chip is 

the congestion reduction operations using alternative divided into contiguous areas that are subjected to simulta- 

algorithms, such as simulated evolution or annealing, or 45 neous congestion reduction processing, a separate error 

variants thereof. estimate can be computed for each area, and a global 

The computation can be terminated when one such loca- rerouting performed when any of the estimates exceeds a 

tion is identified, or can be optimized by computing the predetermined value, 

routing for the net 304 that avoids all congested areas and ^ j^ ov - n windows 

further has minimum total wirelength or other cost param- 50 oving in ows 

eter. The net routings are preferably performed using a As discussed above, the time required to perform a fitness 

Steiner tree or other suitable algorithm. The present parallel calculation increases with the size of the cell placement, and 

processing methodology enables the rerouting and evalua- the number of fitness calculations required per generation 

lion for a plurality or all of the proposed new cell locations increases with the size of the population. The number of 

to be performed simultaneously. 55 generations required to reach a solution increases with the 

The placement is updated by relocating the selected cells size of the population, 

to their new locations. It will be noted that although the cells Thus, the computation time increases rapidly with prob- 

for relocation can be selected globally as having the highest lem size. Taking the memory requirements and computation 

cost values in the entire placement, it is further within the time together, the computational requirements increase very 

scope of the invention to divide the cell into contiguous 60 rapidly with problem size. 

non-overlapping areas, and apply the congestion reduction This problem is overcome in accordance with the present 

algorithm to a plurality or all of the areas simultaneously invention by decomposing various aspects of the physical 

using parallel processors. design problem into tasks that can be performed simulta- 

The overall fitness of the placement is then evaluated to neously using parallel processors in the manner described 

determine if a predetermined fitness criterion has been 65 above with reference to FIGS. 9 to 12. A particularly 

attained. If so, this particular phase of the placement opti- advantageous application of the present method is to divide 

mization is completed. a placement into a plurality of areas or "windows" that 
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constitute subsets of cells of the complete placement, and 2,4,3,4 and 4,4 are common to or overlapped in the window 

process these areas simultaneously using parallel processors. locations Al and A2. 

For example, different areas of the placement can be opti- The overlap is greater for the window locations in the 

mized simultaneously using simulated evolution or central portions of the placement. Taking, for example, the 

annealing, etc. 5 window location A6, all but the four cells 5,5, 5,6, 6,5 and 

Optimal placement of the cells within each window 6,6 in the central portion of the window location A6 overlap 

depends on having the "correct" set of cells assigned to each the adjacent window locations in the opposite respective 

window. In addition, some of the cells will have connections directions by one row or column position, 

to cells outside its own window that will affect the placement During the optimization fitness processing of the subset of 

of these cells within the window. 10 cdis delineated by the window A in each respective location, 

While a constructive placement or other algorithm can be cells which are misplaced or "do not belong" in that window 

used to provide a good partitioning of cells into windows location will move to the outside edges of the window A. As 

and a good initial placement, it will not be perfect. Mecha- the window A is incrementally stepped or "marches" across 

nisms must be provided to iterate toward the solution while the placement, the misplaced cells will be in the overlapping 

exchanging cells and updated cell placement information 15 rows and columns and will move across the placement to 

between the windows. their optimal locations. After successive iterations, the loca- 

A first method of accomplishing this goal is illustrated in tion of each cell approaches its optimum and the cells to 

FIGS. 54 to 56. An exemplary placement is shown that which it is connected in the net can then be placed in their 

consists of a matrix of 169 cell locations arranged in a 13x13 optimum locations. 

matrix. A method of practicing the invention using a single 20 The single window and processor variation of the inven- 

processor and a single window is illustrated in FIG. 54, tion illustrated in FIG. 54 can be advantageously employed 

whereas a multiple window and multiprocessor version of in applications in which only one processor is available, and 

the method is illustrated in FIGS. 55 and 56. it does not have sufficient capacity to process an entire cell 

As illustrated in FIG. 54, a window A is defined as being placement. However, the method is preferably implemented 

constituted by an area equal to a 4x4 matrix of cell locations 25 using a plurality of processors, each operating on a subset of 

of the placement. Although the numbers of rows and col- cells in a respective window. 

umns of the placement and the window are illustrated as As illustrated in FIG. 55, four windows A, B, C and D are 

being equal, the invention is not limited, and it is only moved to initial locations designated as Al, Bl, CI and Dl, 

necessary that the window be smaller than the placement. 3Q and the respective subsets of cells delineated by the win- 

The window A is successively moved to locations on the dows A, B, C and D in these locations are simultaneously 
placement as indicated at Al . . . A16, and a fitness processed for optimization of the cell placement using any 
improvement operation such as simulated evolution, of the algorithms or methods described above, 
mutation, simulated annealing, etc. is performed on the cells The windows A, B, C and D are then moved to right- 
delineated by the window at each location. For example, in 35 wardly by three cell locations to window locations A2, B2, 
the location Al, a subset of cells of the placement includes C2 and D2 as illustrated in FIG. 56, and the subsets of cells 
the cells 1,1 .. . 4,4. The locations to which the window A delineated by the windows A, B, C and D in these positions 
is moved are selected such that each cell of the placement is are processed. Although not specifically shown, the win- 
delineated by the window A and the improvement operation dows A, B, C and D are then moved downwardly by three 
is performed on each cell at least once. ^ cell locations, to the left by three cell locations, and then 

In addition, the window locations are selected so that upwardly by three cell locations to cover the entire place- 
there will be overlap between the subsets of cells delineated ment. 

by the window A in adjacent locations. The length of an edge Since Mol3, N«3, and each window movement is M-l» 

of the placement is designated as N«13, whereas the length 3, the subsets of cells delineated by the windows A, B, C and 

of an edge of the window A is designated as M-4, where 45 D overlap in the manner described above with reference to 

M<N. FIG. 54. However, the use of multiple windows (as many as 

The window A is moved from the initial location Al hundreds or thousands of windows and respective proces- 

rightwardly by three cell locations (M-l), or one location sors can be employed in a practical application) include an 

less than the edge length M of the window. From the location increase in processing speed and a reduction of processor 

A4, the window A is moved downwardly by M-l -3 cell 50 capacity in proportion to the number of windows, 

locations, and leftwardly to the left edge of the placement. Although the windows A, B, C and D in the examples of 

The window A is then moved rightwardly in increments of FIGS. 55 and 56 are moved across the placement in a 

M-l-3 as before. This pattern is repeated until the window predetermined raster pattern, it is further within the scope of 

A has been moved to the location A16. the invention to move one or more non-overlapping win- 

The pattern described and illustrated with reference to 55 dows to locations that are selected in accordance with a 

FIG. 54 is a rectilinear raster type scan pattern. However, the predetermined criterion. 

invention is not so limited, and the window A can be moved In a preferred form of the invention, the interconnect 
in any desired manner as long as each cell is subjected to the congestion of a cell placement is measured using the con- 
fitness improvement operation at least once and there is gestion based cost function method described above, by 
overlap between adjacent window locations and the subsets 60 performing a global routing or by other means. The areas of 
of cells delineated thereby respectively. highest congestion are identified, and windows are moved 
In the illustrated example, there is an overlap of one row over the congested areas. The congested areas are then 
or column of cell locations between each adjacent window processed preferentially. 

location Al to A16. For example, in the window location A2, This provides localized optimization of problem areas or 
the subset of cells that is delineated consists of 1,4 . . . 4,7. 65 "hot spots" of a placement on a priority basis, and can 
The window locations Al and A2 overlap in the fourth substantially accelerate convergence to the optimal place- 
column of the placement, such that the cell locations 1,4, ment. The two methods are not mutually exclusive, and can 
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be preferably used alternately. For example, the raster 
method described with reference to FIGS. 55 and 56 can be 
performed first, and then the congested areas can be iden- 
tified and windows moved over them for subsequent pref- 
erential optimization processing. 5 

A second method of cell placement improvement using 
moving windows in accordance with the present invention is 
illustrated in FIGS. 57 to 59. In this case, the placement is 
first processed using contiguous window locations Al to 
A16 as illustrated in FIG. 57. This can be performed using 10 
a single moving window, or using two or more windows and 
respective processors. Since there is no overlap between 
adjacent window locations, the entire processing can be 
accomplished simultaneously using 16 processors. 

The edge length M=4 of the window in FIG. 57 is not an 15 
integral fraction of the edge length N=13 of the placement. 
For this reason, the window locations A4, A8 and A12 to 
A16 on the right and lower edges of the placement consists 
of less than 16 cell locations. However, the invention is not 
so limited, and it is within the scope of the invention to have 20 
each window location consist of 16 cell locations. For 
example, although not specifically illustrated, the window 
location A4 could consist of the 16 cell locations 1,10 . . . 
4,13, rather than the four cell locations 1,13, 2,13, 3,13 and 
4,13 as shown. 25 

The size of the window used in the example of FIG. 57 is 
M«*4, as in the example of FIGS. 55 to 56. In the second step 
of the method of FIGS. 57 to 59, the placement is again 
processed using a window having a different size. As shown 
in FIG. 58, a window having edge length of L«3 (an area of 30 
9 cell locations) is employed for processing the placement. 

The window locations, designated as El to E25 in FIG. 

58, are contiguous, with the window locations at the left and 
lower portions of the placement consisting of less than 9 cell ^ 
locations. However, a different arrangement in which all 
window locations consist of 9 cell locations can be provided 

as described above with reference to FIG. 57. 

The exchange of cells between windows and movement 
of cells to their optimal locations is accomplished by overlap ^ 
between the different sized windows as illustrated in FIG. 

59. In the illustrated example as illustrated in FIG. 59, the 
window location E2 overlaps the window location Al for 
cell locations 1,4, 2,4 and 3,4. The window location E6 
overlaps the window location Al for cell locations 4,1, 4,2 45 
and 4,3. 

For the window locations in the central portions of the 
placement, each window location of the M«4 window of 
FIG. 57 will be overlapped by window locations of the L-3 
window of FIG. 58 as described above with reference to 50 
FIG. 54. 

The values of M, L and N can be varied over a wide range 
in accordance with the present invention. As a general rule, 
L should not be an integral fraction of M, as this would not 
enable overlap between the two sets of windows. However, 55 
even this limitation can be overcome by offsetting the two 
sets of window locations such that their edges do not 
coincide. 

A third method of cell placement optimization or 
improvement processing is illustrated in FIGS. 60 and 61. 60 
Although only a single moving window is shown for sim- 
plicity of illustration, the invention is not so limited, and the 
method can be and is preferably practiced using a plurality 
of windows for simultaneously processing respective sub- 
sets of cells of the placement that are delineated thereby. 65 

As illustrated in FIG. 60, a window A is moved in a raster 
pattern or in accordance with a prioritization based on 
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interconnect congestion or the like to different non- 
overlapping locations on the placement. The size of the 
window A is M«4, such that the area of the window A is 16 
cell locations as in the examples above. Optimization or 
improvement processing is performed on the subset of cells 
delineated by the window A, in the illustrated example on 
the cells in locations 5,5 . . . 8,8. 

Another window A* is defined that circumscribes and 
moves integrally with the window A- The window A* has an 
edge length of P»M+2 cell locations, and an area of 6x6*36 
cell locations. 

In accordance with the invention, a border area A-A' is 
defined outside the periphery of the window A and inside the 
periphery of the window A' that consists of 20 cell locations 
4,4, 4,5, 4,6, 4,7, 4,8, 4,9, 5,4, 6,4, 7,4, 8,4, 5,9, 6,9, 7,9, 8,9, 
9,4, 9,5, 9,6, 9,7, 9,8 and 9,9. The optimization or improve- 
ment computing or processing means is adapted to process 
the cells delineated by the window A within the area of the 
window A'. In other words, the optimization is performed 
using a larger area than the subset of cells delineated by the 
window A originally occupied. The processing window can 
be considered as being "expanded". 

During processing, the cells having the worst placement 
in the window A are moved into the border area A-A'. These 
cells can be considered as misplaced or "garbage" cells, and 
have optimal locations somewhere in the placement outside 
the windows A and A\ 

These cells are put on a misplaced cell list or "garbage 
list" as illustrated in FIG. 61. Equating, for example, cell 
designations and cell locations, the garbage list for the 
arrangement illustrated in FIG. 60 consists of the cells 4,4, 
4,5, 4,6, 4,7, 4,8, 4,9, 5,4, 6,4, 7,4, 8,4, 5,9, 6,9, 7,9, 8,9, 9,4, 
9,5, 9,6, 9,7, 9,8 and 9,9. 

After a subset of cells delineated by a window is 
processed, or alternatively after the entire placement is 
processed, an attempt is made to relocate the cells in the 
garbage list to acceptable new locations in the placement. 

A preferred method of selecting a new location for a cell 
on the garbage list is to compute the location of the centroid 
of the net to which the cell is connected. For the purposes of 
the invention, the term "centroid" is defined as a general 
term that can alternatively specify center of mass, center of 
gravity, center of force, etc. An example of computing the 
center of gravity of a net was described above with reference 
to FIG. 43. 

If the calculated new location does not already have a cell 
in it, the garbage cell is moved to the new location and 
included in subsequent optimization processing. If the cal- 
culated location is not vacant, the placement attempt fails, 
and the garbage cell remains on the garbage list and is not 
included in subsequent optimization processing. 

A modified method of attempting to relocate a cell from 
the garbage list to the placement is to determine if a window 
delineating the calculated cell location has any vacant cell 
locations, and if so, moving the garbage cell into the most 
suitable vacant cell location. If cells remain on the garbage 
list after the entire placement has been processed, an alter- 
native method for placing these cells can be employed, such 
as using a Steiner tree or detailed routing algorithm and 
feeding back the results to the placement process. 

Whether or not an attempt to move a cell from the garbage 
list to a new location in the placement is successful, the 
calculated new location of the cell is used in calculating new 
locations for other cells on the garbage list. The location data 
can be updated after each attempt at cell relocation, after 
each subset of cells in a respective window is processed, 
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after the entire placement is processed, or at any other sition function XS by which the move to the next location is 

suitable interval. computed. The CA model is executed for a series of itera- 

11. Chaotic Placement tions * 

FIG. 62 illustrates another method of cell placement , , ™ e d V n A amic ^ havior de f e ° ds °\ f nsition functions 

optimization in accordance with the present invention. 5 for * e and the parameter X, which determines how far 

Although the method can be practiced serially using a single a ccl1 ^ move dunn 8 each iteration. For small values of K 

processor, it is preferably performed using a plurality of me svstem changes slowly and in some circumstances can 

parallel processors in accordance with the process decom- become stuck, or "frozen" in a particular state. For moderate 

position described above with reference to FIGS. 6 to 8 and values of the system will converge toward a low energy 

the fail-safe headware methodology described with refer- 10 state. For large values of X, the motion of the cells is chaotic 

ence to FIGS. 9 to 12. and the system tends toward ever higher energy states. 

As illustrated in FIG. 62, a net 312 that constitutes a FIG. 63 illustrates how the centroid, in this case the center 

subset of a placement of cells includes a cell X in a current of gravity or "gravity point", is computed as the first step in 

or initial location 314, and cells A to F that are intercon- determining the location to which a cell is to be moved. The 

nected with the cell X in the net 312. It will be noted that a 15 center of gravity computation is illustrated for two cells, A 

netlist for the placement includes all of the nets and the cells and B, that are at locations 322 and 324 respectively, 

that are interconnected thereby respectively. With the cellX 322 of the cell A is represented in an 

in the initial location 314, the cell interconnect congestion orthogonal system of x and y coordinates as xl,yl, whereas 

and fitness of the placement are assumed to be less than lne i ocat ion 324 of the cell B is represented as x2,y2. The x 

optimal. component of a location 326 of the center of gravity CG of 

The fitness of the placement is improved in accordance me cells A and B is computed as the average of the x 

with the present method by relocating at least some of the components of the locations 322 and 324, more specifically 

cells to more suitable locations. This is done for each cell as ( x l+x2)/2. The y component of the center of gravity CG 

that is to be relocated by computing a location 316 of a k computed as the average of the y components of the 

centroid CG of the other cells in the net 312 and any other 25 locations 322 and 324, more specifically as (yl+y2)/2. 

nets to which the cell X is connected. For the purposes of the Although the computation for only two cells is illustrated 

invention the term centroid is defined as a general term m F , G w ft ^ be understood that the opera tion can be 

that can alternatively specify center of gravity, mass, force, generalized for a net comprising any number of ^Us. 

area, e c. ^ The centroids and values of XS are generally computed as 

It is within the scope of the invention to move the cell X . , _j **•_ *!_ • *• 

j . Ai A iL , r . , . «* continuous analog values m accordance with the invention, 

directly to the centroid location 316. However, the effec- „, . . 0 , , u .. , 

J 1 1 • , . . , . . . , These analog values can be used per se, or can alternatively 

tiveness or the method is enhanced by mtroducing a variable , , , . t , r ,. t . / 

A . , , . , 1 ■ 1 . • «. . oi_ * De rounded off to integer values corresponding to increments 

parameter lambda X, and multiplying the distance S between - . • L j- * n 1 *• i_ »i_ . i. 

\ ..... - Z+a j L 7 1 t_ ^ of the spacing between adjacent cell locations such that each 

the inmal locauon 314 and toe centroid location 316 by K J5 new J ^ ^ nds exacU tQ a ceU loca . 

The cell X is then moved from the initial position 314 , ion of (he lacemenl In the latter case> the optimization win 

toward the centroid location 316 by a distance XS such that , end tQ fmm „ , ocal fi(ness tima fof va , ues of >v |ess (han 

the distance of movement is proportional to the distance S. uni , but ^ m t faster man in m application m 

f X-l. the edi X will be moved exactly to the centroid wh i ch , he analog values are used for values of X greater than 

location 316. If A. is less than unity, the cell X will be moved ^ um * t 

to a location 318 between the initial location 314 and the m ' . 

centroidlocation316asindicatedatCGl.IfXisgreaterthan ,™ e basl ? me u th ° d d^bed above does not result in a 

unity, the ceU X will be moved beyond the centroid location Pl^jnent in which each location is occupied by a single 

316 to a location 320 as indicated at CG2. cell. Some legations can conta 1 n more than one ceU, whereas 

™ t - . . . A , ...... ,i , other locations can be vacant. This is because the method 

The value of A. is selected such that the cell relocation A e j * * i • . . *u c * .t_ * i . j 

... . . . 45 does not take into account the fact that a newly computed 

operations will cause the placement to converge toward an . . . . ... « 

. c £** • ™, location may already be occupied by one or more cells, 

optimal configuration with maximum effectiveness. The J J r j 

factor \ is characterized as a "chaos" factor because as its For mis reason » aDOthcr operation is performed to distrib- 

value is increased, the placement optimization progressively ute * c ^ Us int0 , the respective locations such that each 

diverges. A certain amount of chaos is necessary to prevent 50 IocatlOD ^ occupied by one cell. 

entrapment of the process at local fitness optima. However, Assuming an orthogonal x,y coordinate system, the cells 

if the chaos factor X is too high, the process will diverge into are first sorted in ascending order of their x coordinates. The 

a chaotic state in which the results become non-optimally sorted cells are then equally divided into a number of 

random. groups, with the number of groups being equal to the 

For orthogonal cell placement arrangements, it has been 55 number of columns (extending in the y direction) of cell 

determined experimentally that an optimal solution can be locations, with each group being assigned to a respective 

achieved for values of X between 0 and 1.5, more preferably column. 

between 0.5 and 1.5. The cells in each group are then sorted in ascending order 

Hie present method of placement optimization can also be of their y coordinates, and distributed in this order into the 

viewed using the theory of Cellular Automata (CA). The 60 cel1 locations of the columns respectively. In this manner, 

placement is represented as a 2D lattice, with each cell the cells are distributed into locations that are substantially 

modelled by a finite-state automaton (FSA). Hie inputs to closest to the locations that the cells occupied upon comple- 

the FSA are the locations of neighboring cells and the tion of the basic chaotic placement method, 

locations of the cells to which the cell is connected through The steps of the present method can be practiced in 

the netlist. 65 various ways within the scope of the invention. For example, 

Each FSA consists of a cell state (current locauon), an the method can be performed using a single processor, such 

input alphabet (positions of neighboring cells), and a tran- that one cell is relocated during each incremental operation. 
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Preferably, however, a plurality of cells are relocated simul- 
taneously using parallel processors. 

The new location for each cell can be made available 
immediately for computing the centroid of the net to which 
the cell is connected for the purpose of relocating the other 5 
cells in the net. Alternatively, the initial locations can be 
used for relocating all of the cells in a net, for relocating an 
alternative grouping of cells, or for relocating all of the cells 
in the placement. 

Although the latter version of the method is not entirely 10 
accurate as it can utilize cell locations that no longer exist, 
it is considerably faster than the former version since the 
number of computations is significantly reduced. 

These two alternatives are not mutually exclusive, and 
can be used in combination. For example, a subset of the 15 
cells of the placement can be relocated without using their 
new locations in the centroid calculations. Then, the loca- 
tions can be updated and another subset of cells relocated 
using the new locations. 

Each iteration of the method can involve relocating a 20 
single cell, a plurality of cells or all of the cells in the 
placement. Criteria by which individual cells or groups of 
cells can be selected for serial or parallel relocation include: 

1. The cells constituting each net can be relocated as a 
group. 25 

2. The placement can be partitioned into units consisting 
of rows, columns or blocks of cells. 

3. Cells can be selected at random without replacement 
(each cell is randomly selected only once). 3Q 

4. Cells can be selected at random with replacement (each 
cell can be selected once, more than once or not at all). 

5. Cells can selected in an order that is random, but is the 
same for each iteration. 

The number of iterations by which the method is per- 35 
formed can also be selected in accordance with a number of 
criteria, including: 

1. The method can be performed a predetermined number 
of times (iterations). 

2. The fitness of the placement can be computed after each 40 
iteration, preferably using the congestion based cost 
function methodology described above. The method is 
repeated until the fitness reaches a predetermined 
value. 

3. The method can be repeated until the iteration just 45 
completed has not changed the fitness by more than a 
predetermined value (the operation has frozen in a 
particular state). 

The present chaotic placement method can be enhanced 
by modifying the basic algorithm to include the effects of 50 
cells in locations proximate to the initial location of a cell 
that is to be relocated, or to include the effects of all other 
cells in the placement. 

This spreads out clumps of cells so that the density of cells 
is more uniform throughout the placement. The attraction 55 
between cells in the nets is balanced against repulsion 
caused by a high local cell density, providing an optimized 
tradeoff of wire length, feasibility and congestion. 

A first methodology for accomplishing this goal is illus- 
trated in FIG. 64, assuming that a cell X is initially placed 60 
at a location 328. A cell density gradient is then computed 
for the cells in a predetermined pattern 330 proximate to the 
location 328, such as enclosed in a dashed line. The density 
gradient represents the local density of cells in the place- 
ment. 65 

The density gradient for each cell location includes a 
magnitude, and a direction of decreasing density as indi- 
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cated by a vector 332. Similar vectors are illustrated for the 
other cell locations in FIG. 64. 

The magnitude of the density gradient at each cell loca- 
tion can be easily computed as being equal to the number of 
cells at the respective location. The decreasing density 
direction is computed using any of a number of known 
weighted or unweighted averaging functions, taking into 
account the cells in the pattern and their distance from the 
location 328. 

Using the modified method of FIG. 64, each cell is moved 
toward its calculated centroid by the distance XS, and also by 
an offset corresponding to the computed density gradient. 
These two movements can be calculated and applied 
individually, or can be produced as a composite function 
resulting from the centroid computation and the density 
gradient computation. In either case, the movement corre- 
sponding to the density gradient is made in the decreasing 
density direction, and by a distance proportional or other- 
wise suitably related to the magnitude of the density. 

The centroid computation can also be offset by a function 
based on a simulated net force that is exerted on each cell by 
proximate cells, or by all of the other cells in the placement 
as illustrated in FIGS. 65 and 66. The net force is preferably 
a simulated electrostatic force based on the assumption that 
each cell is a charged particle having a unit electrostatic 
charge, although the invention is not so limited, and any 
suitable function can be utilized to offset the centroid 
computation based on the distribution of cells in the place- 
ment. 

FIG. 65 illustrates an exemplary subset of nine cell 
locations, including a central cell location 334 and eight cell 
locations A to H that surround the location 334. The simu- 
lated net repulsive force exerted on a cell in the location 334 
by the cells in the locations A to H is based on the inverse 
square law of electrostatics, such that the repulsive force F 
between two charged particles of the same electrostatic 
polarity is given as F=(QlxQ2)/R 2 , where Ql and Q2 are the 
electrostatic charges of the particles and R is the distance 
therebetween. 

The location 334 may be occupied by more than one cell. 
However, the method is preferably applied to each cell in the 
location 334 individually. Therefore, the net force is a 
function only of the cells in the locations A to H. The cells 
in the location 334 are considered as repelling each other. 

In the illustrated example, the locations A to H contain 
numbers of cells as follows: A=l; B=0; C=2; D=0; E-l; 
F-3; G-l; H«0. The vacant locations B, D and H do not 
have any effect on the cell or cells in the location 334. 
Alternatively, an empty location may exert an attractive 
force toward a cell. 

It will be assumed that each cell has a unit charge 
(Q1-Q2-1), and that the distance between orthogonally 
adjacent cells is unity (R-l). The force between two cells in 
orthogonally adjacent locations is therefore F- 1/1-1. 

The distance between two cells in diagonally adjacent 
locations is 1x2 in . The force between two cells in diago- 
nally adjacent locations is therefore l/(2 1/2 ) 2 -l/2. The mag- 
nitude of each of the x and y components of this force is 

l/(2x2 1/2 )~l/2.83-0.35. 

FIG. 66 is a vector diagram illustrating the forces acting 
on a cell X in the location 334, in which the vectors are 
designated by the reference characters A to I I corresponding 
to the respective cell locations. 

The location A contains one cell. The force exerted on the 
cell X in the location 334 by this cell has an x component 
with a magnitude of 0.35 that ads rightwardly, and a y 
component with a magnitude of 0.35 that acts downwardly 
as illustrated. 
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The location C contains two cells, so that the force is 
twice that of the single cell in the location A. The force of 
the two cells in the location C has an x component with a 
magnitude of 0.7 that acts leftwardly, and a y component 
with a magnitude of 0.7 that acts downwardly. 5 

The location E contains one cell, and is orthogonally 
adjacent to (rightward of) the location 334. The force 
exerted by this cell has a magnitude of 1.0 and acts left- 
wardly. 

The location F contains three cells, so that the force is 10 
three times that of the single cell in the location A. The force 
of the three cells in the location F has an x component with 
a magnitude of 1.05 that acts rightwardly, and a y component 
with a magnitude of 1.05 that acts upwardly. The location G 
contains one cell, and exerts a force with a magnitude of 1.0 is 
in the upward direction. 

The resultant of these forces, or the net force exerted on 
the cell X in the location 334, is designated as a vector R, 
and has a magnitude of 1.09, and is displaced by an angle 
6=16.7° counterclockwise from the positive y axis. 20 

The movement of the cell X from the location 334 is a 
combination of the movement computed using the centroid 
calculation, and a movement based on the net force vector 
R. The latter movement is made in the direction of the net 
force vector R, and by a distance that is proportional to the 25 
magnitude of the vector R or computed in accordance with 
another suitable function of the magnitude of the vector R. 

Although the simplified example of FIGS. 65 and 66 
includes only eight cell locations that surround a single cell 
location, the invention preferably in actual practice com- 30 
putes a net force based on a larger number of cell locations, 
or all of the cell locations in the placement, using the same 
principle, or just cells in the neighborhood to reduce com- 
putational complexity. 

The invention is further not limited to the particular 35 
functional computation that was described with reference to 
FIGS. 65 and 66. For example, the x and y force components 
can be computed as being inversely proportional to the 
distance between two locations, rather than inversely pro- 
portional to the square of the distance. It is further within the 40 
scope of the invention to calculate the offsets as functions of 
simulated attractive, rather than repulsive forces. 

Another function that can be utilized to calculate the x and 
y force components is given as: 



[ixm\dx i \\/\max\ity i \) i 50 

where Fx and Fy are the x and y net force components; n 
is the number of cell locations that affect a cell to be 
relocated; dx and dy are the x and y distances between 55 
the location of the cell to be relocated and a cell for 
which the force is being computed; and dx, and dy, arc 
the x and y components of the distances between the 
location of the cell to be relocated and the cells in the 
locations that affect the cell to be relocated. In the 60 
denominator of the equations, the number that is cubed 
is the maximum value of dx,- or dy,-, whichever is larger. 

12. Distributed Shared Memory Implementations 
a. Single Chip Processor Node 65 

FIGS. 67 to 70 illustrate a single integrated circuit chip 
DSM processor node 500 of the present invention. A plu- 
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rality of the nodes 500 can be interconnected to implement 
the functionality of the DSM architecture 136 illustrated in 
FIG. 9, or the entire architecture 136 can be implemented in 
a single node 500. 

The node 500 comprises a computing unit 502 that 
includes, as shown in FIG. 68, a processor 504 and a cache 
memory 506. The node 500 further includes a main memory 
508, a memory controller 510 and an interconnect interface 
512. 

Further illustrated are an input-output (I/O) interface 514 
for connecting the node 500 to an I/O device or peripheral 
516 such as a keyboard, monitor, disk drive, video camera, 
A/D or D/A converter, fraraebuffer, printer, etc. An inter- 
connect controller 518 connects the node 500 to a remote 
node 522 via a communications channel 520. The units 514 
and 518 are preferably integrated onto the same chip as the 
node 500, but can be separate therefrom within the scope of 
the invention. 

The processor 504 is selected to have a relatively simple 
functionality, such as Reduced Instruction Set Computer 
(RISC), and is therefore inexpensive to fabricate and occu- 
pies a small area on the chip. The invention is not so limited, 
however, and the processor 504 can be implemented by any 
suitable type of general purpose CPU, or a special purpose 
processor such as graphics, disk controller, Direct Memory 
Access (DMA) controller, etc. It is yet further within the 
scope of the invention to replace the processor 504 with a 
simple logic element such as a shift register. 

Although having simple functionality, the processor 504 
can still implement a full general purpose modem RISC 
architecture with 32-bit, 64-bit, or greater addressing and 
virtual memory capability, which allows the node 500 to be 
used in the construction of very large machines for solving 
very large problems. 

The cache memory 506 is implemented in Static Random 
Access Memory (SRAM) to provide the required access 
speed, whereas the main memory 508 is implemented in low 
cost Dynamic Random Access Memory (DRAM). The 
memory controller 510 interconnects and maintains memory 
coherency between the processor 504, cache memory 506, 
and the memory in the remote node 522. It will be noted that 
although only a single computing unit 502 is illustrated in 
the drawing, the scope of the invention includes providing 
multiple computing units in the node 500 that share the main 
memory 508. 

The number of remote nodes 522 is similarly unrestricted. 
The nodes 522 can be similar to the node 500, or can be of 
different types as long as they are capable of communicating 
with the node 500 using the communications channel pro- 
tocol. The channel 520 can be serial and/or parallel, and 
include transceivers for electrical and/or optical intercon- 
nections. 

The memory controller 510 controls access to the cache 
memory 506 and the main memory 508. The interconnect 
interface 512 converts memory access instructions (read and 
write commands) from the processor 504 for accessing data 
stored in a memory (not shown) in the remote node 522 into 
memory access references or messages that are transmitted 
by the interconnect controller 518 to the remote node 522 
over the communications channel 520 in the form of data 
packets. 

In response, the remote node 522 performs the requested 
operation and sends a suitable message back to the node 500. 
For a read instruction, the message includes the requested 
data. For a write instruction, the message includes a block 
identifier and/or memory address for the data which was 
stored. 
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The interconnect interface 512 performs the reverse 
operations in response to memory access messages received 
from the remote node 522. In response to a read message, the 
memory controller 510 retrieves the requested data from the 
cache memory 506 or the main memory 508, and the 
interface 512 sends a message including the data to the 
remote node 522. In response to a write message, the 
memory controller 510 stores the included data in the cache 
memory 506 or the main memory 508, and sends a message 
to the remote node including a block identifier and/or 10 
memory address for the data that was stored. 

It will be noted that the present invention is not limited to 
the particular illustrated configuration. For example, the 
node 500 can include only a single cache coherent memory, 
or more than two cache coherent memories. As yet another 15 
alternative, the interconnect interface 512 can be modified to 
provide communication in only one direction, such as in a 
ring network arrangement (not shown). 

As discussed above, a conventional multi-chip DSM 
architecture is too large to be implemented on a single 20 
integrated circuit chip. For example, as will be described in 
detail below, a typical multi-chip DSM architecture requires 
approximately 1,692 mm 2 of chip area, which is much larger 
than the 256 mm 2 area of a conventional 16 mmxl6 mm 
chip. 

An important principle of the invention is that, with a 
DSM node 500 implemented on a single integrated circuit 
chip as presently disclosed, the capacity of the cache 
memory 506 can be reduced sufficiently to enable the cache 
memory 506 and other elements of the DSM node 500 to fit 30 
on the chip without reducing the processing speed of the 
node 500. 

Although reducing the capacity of the cache memory 506 
increases the cache miss rate, the reduced latency provided 
by integrating the processor 504 and main memory 508 on 35 
a single chip reduces the cache miss resolution time or cost 
to an extent that compensates for the increased cache miss 
rate. 

In addition, the RISC processor 504 is substantially 
smaller than a more complicated processor that would be 40 
required to provide the same processing speed in a multi- 
chip DSM implementation, thereby enabling the processor 
504 to fit on the chip with the other elements. 

The smaller and less expensive processor 504 also 
increases the number of processors (only one processor 504 45 
is shown) that can be connected to a main memory 508 of 
predetermined capacity. This increases the number of pro- 
cessors that can simultaneously operate on a problem 
defined by the main memory space and thereby increases the 



semiconductor technology, the "small and simple" approach 
to processor, cache, and main memory design will continue 
to have advantages over the conventional approach. 

The unique manner in which the present invention over- 
comes the problems of the prior art and enables the DSM 
node 500 to be implemented on a single integrated circuit 
chip will become more apparent from the following 
example. 

EXAMPLE 

An integrated circuit fabrication process is assumed as 
having the following characteristics. 



Type 


CMOS 


Feature size 


0.5 micron 


Logic density 


2,500 gates/mm 2 


SRAM density 


2 KB/mm 2 


DRAM density 


32 KB/mm 2 


Chip area 


256 mm 2 



The characteristics and chip areas for the present single- 
chip node 500 and a conventional multi-chip node having 
25 comparable performance are given below. 



ITEM 


MULTI-CHIP 


SINGLE-CHIP 


1. Processor 






Logic 


250 K gates 


80 K gates 


Cache memory 


1 MB 


32 KB 


Main memory 


32 MB 


4 MB 


Dock speed 


200 MHz 


300 MHz 


Clocks/instruction 


0.8 


1.2 


Cache miss cost 


400 ns 


60 ns 


Cache miss rate 


1.5% 


4.3% 


Processing speed 


85 MIPS 


144 MIPS 


2. DSM logic 


200 K gates 


200 K gates 


3. Total logic 


450 ft gates 


280 K gates 


4. Chip Area 






Logic 


180 mm 2 


112 mm 2 


Cache memory 


512 mm 2 


16 mm 2 


Main memory 


1,000 mm 2 


125 mm 2 


Total 


1,692 mm 2 


253 mm 2 



It will be understood from the above that the present 
invention enables the capacity or size of the cache memory 
506 to be reduced from 1 megabyte for a conventional 



_ multi-chip DSM implementation to 32 kilobytes for the 

com7uta"tion7l efficienV a^? ako reduces"the"ainount of 50 P resent n f^ e / 00 - ™ s * he d»P area of the cache 



main memory that is required for each processor. The ability 
of the present DSM node 500 to be implemented on a single 
integrated circuit chip is also enhanced. 

More specifically, tens to hundreds of megabytes of main 
memory are currently used per processor. This ratio balances 55 
the cost of processor and memory and is also required to 
supply enough memory bandwidth for the processor. The 
high bandwidth available from the present on-chip main 
memory 508 and the reduced cost of the processor 504 both 
support a reduction in the amount of main memory per 60 
processor. This reduction in the amount of main memory 
makes it feasible to include the main memory 508 on the 
same chip as the processor 504. 

Using the principles of this invention, as the semiconduc- 



memory 506 from 512 mm 2 for the multi-chip configuration 
to 16 mm 2 for the present node 500. Although the cache miss 
rate is increased from 1.5% to 4.3%, the cache miss cost is 
reduced from 400 ns to 60 ns, thereby more than compen- 
sating for the increased cache miss rate. 

The capacity of the main memory 508 can be reduced 
from 32 megabytes to 4 megabytes, thereby reducing the 
size of the main memory 508 from 1,000 mm 2 to 125 mm 2 . 
Even assuming that the same DSM logic is used, the total 
logic requirement is reduced from 450 K gates to 280 K 
gates, reducing the logic area of the chip from 180 mm 2 to 
112 mm 2 . 

As a result of the invention, the total chip area is reduced 
from 1,692 mm 2 for a conventional DSM multi-chip archi- 



tor technology continues to advance, multiple DSM nodes 65 lecture to 253 mm 2 for the present node 500, enabling the 
500 can be integrated on a single chip. Because of the node 500 to be integrated onto the 256 mm 2 area of a 
increasing signal propagation delay issues with advanced standard 16 mmxl6 mm chip. 
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In addition to the substantial size and cost reduction and 
advantageous single-chip implementation of the present 
node 500, the processing speed thereof is increased by 69% 
from 85 MIPS to 144 MIPS over the prior art arrangement. 

Referring again to FIG. 68, the computing unit 502 further 
comprises a floating point unit 524 that functions integrally 
with the processor 504 for performing non-integer arith- 
metic operations. 

The processor 504 operates using virtual addresses. A 
memory management unit (MMU) 526 maps these virtual 
addresses to the local physical addresses of the node 500. A 
cache controller 528 maintains cache coherence between the 
cache memory 506 and any other cache memories that are 
connected to the processor 504 via a virtual address bus 530 
using a conventional snooping or other scheme. 

A processor bus interface 532 connects the computing 
unit 502 to the memory controller 510 and I/O interface 514 
via a processor bus 534. The interface 532 passes data 
between the virtual address bus 530, cache controller 528, 
MMU 526 and processor bus 534 using local physical 
addresses. The invention can also be implemented with a 
processor bus using virtual addresses or combined virtual 
and physical addresses. 

As illustrated in FIG. 69, the memory controller 510 
comprises a processor bus interface 536 for connection to 
the processor bus 534 and a DRAM controller 538 for 
controlling access to the main memory 508. The controller 
510 further includes a directory controller 540 that stores 
and modifies a directory in the main memory 508. It will be 
noted, however, that the invention is not so limited, and that 
the directory can be stored in a dedicated memory (not 
shown) in the controller 540. 

The directory is typically two dimensional, including a 
first dimension that represents the memory elements (cache 
and main memory) of all memories in the node 500 and 
remote nodes connected thereto, and a second dimension 
that represents data created by the system as divided into 
blocks of fixed size. 

An entry is made in the directory for each memory 
element that stores a particular block of data, and the status 
of the data (uncached, shared, dirty, etc.). If data in the node 
500 is modified, the directory controller 540 sends messages 
to all other memory elements in the system that contain 
copies of the modified data, causing the obsolete copies to 
be updated or invalidated. 

As illustrated in FIG. 70, the interconnect interface 512 
includes a global memory management unit (GMMU) 542 
for converting the local physical addresses that are used 
internally by the node 500 into global physical addresses 
that are used by the interconnect controller 518 for trans- 
mitting data over the communications channel 520. The 
GMMU 542 also provides access control to regions of 
memory, and sets attributes for each region in accordance 
with a memory model. 

A remote memory access unit 544 converts memory 
access instructions for accessing remote memory into 
memory access references or messages, and a memory 
reference message packet assembly 546 assembles the mes- 
sages into packets for transmission over the channel 520 as 
described above. 

A memory reference message packet disassembly 548 
similarly disassembles memory access references or mes- 
sages that are received over the channel 520, whereas a 
remote request server 550 converts the memory access 
messages into memory access instructions. 

The interconnect controller 518 is preferably imple- 
mented by a communications protocol interface unit and 
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router such as described in a technical disclosure entitled 
"The S3.mp Interconnect System and TIC chip", by A. 
Nowatzyk, Proceedings of IEEE Computer Society HOT 
Interconnect Symposium, Stanford University, 1993. 

5 b. Single Chip Communications Node 

FIG. 71 illustrates a single integrated circuit communi- 
cations node 600 for connecting an I/O device or peripheral 
602 and associated local memory 604 to one or more remote 
nodes 606. The peripheral 602 can be a CRT monitor, video 
camera or any other suitable device. An especially desirable 
application for the node 600 is for simultaneous video 
teleconferencing in which two or more video camera/ 
monitor units are interconnected by a network. 

Although prior art networks such as Ethernet, Token Ring, 
DECNet and RS-232 are capable of providing this function, 

15 they are relatively slow and not scalable. In addition, they 
require an expensive network interface adaptor for each 
device that is connected to the network. 

The present node 600 is scalable, can be fabricated very 
inexpensively on a single integrated circuit chip, and is 

20 faster in operation than conventional networks. This is 
because all transmissions consist of memory access refer- 
ences or messages in packet or cell form, and all memories 
connected to the system are maintained coherent. 

The node 600 consists of a memory controller 608 and an 

25 interconnect interface 610 that are constructed and operate 
in the manner described above with reference to the ele- 
ments 510 and 512 of the DSM node 500 respectively. The 
node 600 is therefore a subcombination of the node 500. The 
node 600 does not necessarily include a processor, although 

30 a processor can be added, because the node 600 is typically 
controlled remotely by a processor in a full DSM note 500. 
Alternatively, by a process (not shown) in the peripheral 
602. The local memory 604 can be provided on a separate 
chip, or more preferably, integrated onto the same chip as the 

35 node 600. 

The interconnect interface 610 is shown as being con- 
nected through a unidirectional or bidirectional channel 
control 612 and a communications channel 614 to the 
remote node 606. The control 612 differs from the intercon- 

40 nect controller 518 of the DSM node 500 in that it provides 
protocol interface only, without routing. This enables point 
to point communications between two nodes. However, the 
interconnect controller 518 can be substituted for the chan- 
nel control 612 if connection and routing to a plurality of 

45 nodes is desired. 

In summary, the present invention provides a process 
optimization method that is capable of solving extremely 
large problems including massive numbers of interrelated 
variables, and a parallel processing architectural structure 

50 for implementing the method. Various modifications will 
become possible for those skilled in the art after receiving 
the teachings of the present disclosure without departing 
from the scope thereof. 
We claim: 

55 1. A physical design automation system for producing a 
highest fitness cell placement for an integrated circuit chip, 
comprising: 

a processor for altering a population of cell placements 
using a predetermined first fitness improvement algo- 
60 rithm or a predetermined second fitness improvement 
algorithm; and 
a controller for controlling the processor to select and use 
said first fitness improvement algorithm or said second 
fitness improvement algorithm in accordance with a 
65 predetermined optimization criterion. 

2. A system as in claim 1, in which said first fitness 
improvement algorithm comprises genetic evolution. 
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3. Asystem as in claim 2, in which said second fitness 18.Amethodof increasing a fitness of a cell placement for 
improvement algorithm comprises simulated annealing. an integrated circuit chip, comprising the steps of: 

4. A system as in claim 2, in which said second fitness (a) altering a population of cell placements using a 
improvement algorithm comprises genetic mutation. predetermined first fitness improvement algorithm or a 

5. Asystem as in claim 1, in which the controller controls 5 predetermined second fitness improvement algorithm; 
the processor to initially select and use said first fitness and 

improvement algorithm, and to subsequently select and use (b) controlling step (a) to select and use said first fitness 

said second fitness improvement algorithm in accordance improvement algorithm or said second fitness improve- 

with said optimization criterion. ment algorithm in accordance with a predetermined 

6. A system as in claim 5, in which said first fitness 10 optimization criterion. 

improvement algorithm comprises genetic evolution. 19. A method as in claim 18, in which said first fitness 

7. A system as in claim 6, in which said second fitness improvement algorithm in step (a) comprises genetic evo- 
improvement algorithm comprises simulated annealing. lution. 

8. A system as in claim 6, in which said second fitness „ 20. Amethod as in claim 19, in which said second fitness 
improvement algorithm comprises genetic mutation. improvement algorithm in step (a) comprises simulated 

9. A system as in claim 1, further comprising: annealing. 

J e 21. A method as in claim 19, in which said second fitness 

a fitness computer for computing fitnesses of said cell improvement algorithm in step (a) comprises genetic muta- 

placements, in which: t i on 

said optimization criterion comprises a predetermined 20 22. Amethod as in claim 18, in which step (b) comprises 

function of said fitnesses. controlling step (a) to initially select and use said first fitness 

10. A system as in claim 9, in which the fitness computer improvement algorithm, and to subsequently select and use 
computes said fitnesses as a predetermined function of cell said second fitness improvement algorithm in accordance 
interconnect congestion in said cell placements. with said optimization criterion. 

11. Asystem as in claim 9, in which the controller controls 23. A method as in claim 22, in which said first fitness 
the processor to repeatedly alter said cell placements until a improvement algorithm in step (a) comprises genetic evo- 
predetermined termination criterion is reached. lution. 

12. Asystem as in claim 11, in which the controller selects 24 A melhod 35 ?/*aim in which said second fitness 
a cell placement from said population that has a highest 30 improvement algorithm in step (a) comprises simulated 
fitness value as said highest fitness cell placement after said ^ fa ^ . q ^ ^ ^ ^ 

termination cntenon is reached improvement algorithm in step (a) comprises genetic muta- 

13. A system as in claim 11, in which the controller ^ 

controls the processor to initially select and use said first 1 26 A method as in daim 18> fiirthw comprising the step 

fitness improvement algorithm, and to select and use said 35 Q p 

second fitness improvement algorithm after the processor ( ^p^g fitnesses of ^ ceU placements; in which 

has altered said cell placements a predetermined number of / . . . , 

^ mes step (b) comprises computing said optimization cntenon 

4 , . n • . . 11 as including a predetermined function of said fitnesses. 

14. A system as in claim 11, in which the conUoller ^ ^ * ^ in ^ ^ 

controls the processor to initially select and use said first <o ti id fitnesses ^ a predetermine d function of cell 

fitness improvement algonthm, and to select and use said mteroonnect conges tion in said cell placements, 

second fitness improvement algorithm after a fitness value of 2g A melhod ^ m daim 26> m whicfa ^ ^ forther 

a placement having a highest fitness has reached a prede- comprises controlling step (a) to repeatedly alter said cell 

termined value. 45 pi acem ents until a predetermined termination criterion is 

15. A system as in claim 11, in which the controller reached. 

controls the processor to initially select and use said first 29. A method as in claim 28, further comprising the step 

fitness improvement algorithm, and to select and use said 0 f : 

second fitness improvement algorithm after the processor (d) a cell placement from said population that 

has altered said cell placements a predetermined number of 5Q ha$ a fi(ness value as said highest filness ceU 

times without producing a change in a fitness value of a placement after said termination criterion is reached, 

placement having a highest fitness value. 3Q A method as in daim 28> m which slep (b ) comprises 

16. A system as in claim 11, in which the controller controlling step (a) to initially select and use said first fitness 
controls the processor to initially select and use said first inipro v e m e rit algorithm, and to subsequently select and use 
fitness improvement algonthm, and to select and use said 55 said second fitQess im p rovemen t algorithm after said cell 
second fitness improvement algorithm after the processor p i acements nave beerl a i ter ed a predetermined number of 
has altered said cell placements and a fitness value of a t - mes 

placement having a highest fitness value has changed by less 31 ."a method as in claim 28, in which step (b) comprises 

than a predetermined value. controlling step (a) to initially select and use said first fitness 

17. A method of increasing a fitness of a permutation of 6Q improvement algorithm, and to subsequently select and use 
a predetermined number of entities, compnsing the steps of: sai(J fitness improvement algorithm after a fitness 

(a) altering said permutation using a predetermined first value of a placement having a highest fitness has reached a 
fitness improvement algorithm or a predetermined sec- predetermined value. 

ond fitness improvement algorithm; and 32. Amethod as in claim 28, in which step (b) comprises 

(b) selecting said first fitness improvement algorithm or 65 controlling step (a) to initially select and use said first fitness 
said second fitness improvement algorithm in accor- improvement algorithm, and to subsequently select and use 
dance with a predetermined optimization criterion. said second fitness improvement algorithm after said cell 
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placements have been altered a predetermined number of 
times without producing a change in a fitness value of a 
placement having a highest fitness value. 

33. A method as in claim 28, in which step (b) comprises 
controlling step (a) to initially select and use said first fitness 
improvement algorithm, and to subsequently select and use 
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said second fitness improvement algorithm after said cell 
placements have been altered and a fitness value of a 
placement having a highest fitness value has changed by less 
than a predetermined value. 

* * * * * 



04/25/2004, EAST Version: 1.4.1 



